<html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> On 4/27/2015 1:28 PM, Vasil Valchev wrote:<br> <blockquote cite="mid:CAFZxf=L6cbpiC-pw1cKLD=PQ5iraOmR2wq1JCgj9HwaZFtoYxQ@mail.gmail.com" type="cite"> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <div dir="ltr"> <div>Hi,</div> <div><br> </div> I would advise you to use quorum disk _only_ as a last resort - it's better to first get a solid understanding of the clustering solution before adding additional complexity. <div>An amazingly thorough and well described tutorial you can find here: <a moz-do-not-send="true" href="https://alteeve.ca/w/AN%21Cluster_Tutorial_2">https://alteeve.ca/w/AN!Cluster_Tutorial_2</a></div> </div> </blockquote> [Jatin] Thank you very much for sharing this tutorial. I will surely go through it and gain more understanding.<br> <blockquote cite="mid:CAFZxf=L6cbpiC-pw1cKLD=PQ5iraOmR2wq1JCgj9HwaZFtoYxQ@mail.gmail.com" type="cite"> <div dir="ltr"> <div><br> </div> <div>Especially useful are the first chapters - the theory.</div> <div>What I suspect is happening in your case is that your cluster communication and fencing are over the same network, which is not fault tolerant.</div> </div> </blockquote> [Jatin] <br> My cluster communication happens over one network while fencing happens over other network. I use two seperate vlans for this purpose. Secondly when the cluster communication fails due to network outage then fencing happens over the other vlan and both the nodes get fenced.<br> <blockquote cite="mid:CAFZxf=L6cbpiC-pw1cKLD=PQ5iraOmR2wq1JCgj9HwaZFtoYxQ@mail.gmail.com" type="cite"> <div dir="ltr"> <div>So what happens if this network fails? Your 2 nodes can't see each other, so they send fence requests, but the fence devices are unreachable too, so those requests fail.</div> <div>They are retried a few times I think, but if all fail, the fence agent returns failed and your cluster is stuck in "recovering" or stopped state.</div> <div>Other times the network outage is shorter and the fence succeeds, resulting in both nodes going down - this is solved with the delay parameter.</div> <div>The first issue is architectural one, it is the expected behavior of the cluster to stop (or "freeze") all resources if it can't guarantee the state of all members.</div> <div><br> </div> <div>Read the article above it's really very useful.</div> <div><br> </div> <div>Cheers!<br> </div> </div> </blockquote> <br> Thanks<br> Jatin<br> <br> </body> </html>