[Linux-cluster] Working of a two-node cluster

Mon Apr 27 07:58:55 UTC 2015

Hi,

I would advise you to use quorum disk _only_ as a last resort - it's better
to first get a solid understanding of the clustering solution before adding
additional complexity.
An amazingly thorough and well described tutorial you can find here:
https://alteeve.ca/w/AN!Cluster_Tutorial_2

Especially useful are the first chapters - the theory.
What I suspect is happening in your case is that your cluster communication
and fencing are over the same network, which is not fault tolerant.
So what happens if this network fails? Your 2 nodes can't see each other,
so they send fence requests, but the fence devices are unreachable too, so
those requests fail.
They are retried a few times I think, but if all fail, the fence agent
returns failed and your cluster is stuck in "recovering" or stopped state.
Other times the network outage is shorter and the fence succeeds, resulting
in both nodes going down - this is solved with the delay parameter.
The first issue is architectural one, it is the expected behavior of the
cluster to stop (or "freeze") all resources if it can't guarantee the state
of all members.

Read the article above it's really very useful.

Cheers!

On Mon, Apr 27, 2015 at 9:44 AM, Vijay Kakkar <vijaykakkars at gmail.com>
wrote:

> You should look for qdisk now.I hope this will be helpful.
>
> On Mon, Apr 27, 2015 at 11:38 AM, Jatin Davey <jashokda at cisco.com> wrote:
>
>>  Yes , I did restart it.
>>
>>
>> On 4/27/2015 11:31 AM, emmanuel segura wrote:
>>
>> did you restarted the cluster after added the delay parameter?
>>
>> 2015-04-27 7:49 GMT+02:00 Jatin Davey <jashokda at cisco.com> <jashokda at cisco.com>:
>>
>>  Ok , i tried with delay but it has not helped. I guess i have to try using
>> quorum disk now.
>>
>> Thanks
>> Jatin
>>
>> On 4/24/2015 7:06 PM, Vijay Kakkar wrote:
>>
>> You may need to delay the fencing ( delay=seconds ) or use quorum disk if
>> delaying the fencing doesn't help.
>>
>> On Fri, Apr 24, 2015 at 6:23 PM, Jatin Davey <jashokda at cisco.com> <jashokda at cisco.com> wrote:
>>
>>  Here is my cluster.conf file
>>
>> ************************
>> <?xml version="1.0"?>
>> <cluster config_version="4" name="****">
>>         <clusternodes>
>>                 <clusternode name="node-103" nodeid="1">
>>                         <fence>
>>                                 <method name="Method01">
>>                                         <device name="node-103"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>                 <clusternode name="node-105" nodeid="2">
>>                         <fence>
>>                                 <method name="Method02">
>>                                         <device name="node-105"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>         </clusternodes>
>>         <cman expected_votes="1" two_node="1"/>
>>         <fencedevices>
>>                 <fencedevice agent="fence_ipmilan" auth="password"
>> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-103" passwd="*****"
>> privlvl="ADMINISTRATOR"/>
>>                 <fencedevice agent="fence_ipmilan" auth="password"
>> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-105" passwd="******"
>> privlvl="ADMINISTRATOR"/>
>>         </fencedevices>
>>         <fence_daemon post_join_delay="120"/>
>>         <rm>
>>                 <resources>
>>                         <netfs export="/test" force_unmount="1"
>> fstype="nfs" host="x.x.x.x" mountpoint="/test/test/test" name="test123"/>
>>                         <ip address="x.x.x.x" sleeptime="5"/>
>>                         <script file="/xxx/xxx/xxx/xxx/xx.sh"
>> name="xxxx"/>
>>                 </resources>
>>                 <failoverdomains>
>>                         <failoverdomain name="Failover01" nofailback="1"
>> ordered="1">
>>                                 <failoverdomainnode name="node-103"
>> priority="1"/>
>>                                 <failoverdomainnode name="node-105"
>> priority="2"/>
>>                         </failoverdomain>
>>                 </failoverdomains>
>>                 <service domain="Failover01" name="Service01"
>> recovery="relocate">
>>                         <ip ref="x.x.x.x"/>
>>                         <netfs ref="test123"/>
>>                         <script ref="xxxx"/>
>>                 </service>
>>         </rm>
>> </cluster>
>>
>>
>> On 4/24/2015 6:01 PM, emmanuel segura wrote:
>>
>>  please share your cluster config, maybe in this way someone can help you.
>>
>> 2015-04-24 14:12 GMT+02:00 Jatin Davey <jashokda at cisco.com> <jashokda at cisco.com>:
>>
>>  Hi
>>
>> I am using a two node cluster using RHEL 6.5. I have a very fundamental
>> question.
>>
>> For the two node cluster to work , Is it mandatory that both the nodes
>> are
>> "online" and communicating with each other ?
>>
>> What i can see is that if there is communication failure between them
>> then
>> either both the nodes are fenced or the cluster gets into a "stopped"
>> state
>> (Seen from output of clustat command).
>>
>> Apologies if my questions are naive. I am just starting to work with
>> RHEL
>> cluster add-on.
>>
>> Thanks
>> Jatin
>>
>> --
>> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>   --
>> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>  --
>> Cheers
>>
>> Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}
>>
>> Techgrills Systems Pvt. Ltd.
>> 011-46521313 | +919999103657http://www.techgrills.comhttp://lnkd.in/bnj2VUU
>>
>>
>>
>>
>> --
>> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> Cheers
>
> *Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}*
>
> Techgrills Systems Pvt. Ltd.
> 011-46521313 | +919999103657
> http://www.techgrills.com
> http://lnkd.in/bnj2VUU
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150427/5805bfd3/attachment.htm>