Hi Ofer Inbar,<div>When cluster service start failover to other node, after some time still the service in recovery mode, then the cluster again showing the service is failed, may i know whats the default time cluster will wait for the service to recover completely? Also can we increase the cluster wait time? If yes, then where is the config we need to extend the default time? Valuable suggestions are really helpful.</div> <div><br></div><div>In my scenario, i am facing the same kind of problem, when cluster waits for around 15 min, if the service not recovered properly again cluster killing the service and showing as failed. I am manually stopping the cluster services on all the nodes and starting service as standalone to recover all the things and putting back in cluster after service starts perfectly.</div> <div><br></div><div>Thanks in Advance,</div><div><br></div><div>BSK.<br><br><div class="gmail_quote">On Tue, Aug 9, 2011 at 9:30 PM, <span dir="ltr"><<a href="mailto:linux-cluster-request@redhat.com">linux-cluster-request@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Send Linux-cluster mailing list submissions to<br> <a href="mailto:linux-cluster@redhat.com">linux-cluster@redhat.com</a><br> <br> To subscribe or unsubscribe via the World Wide Web, visit<br> <a href="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster</a><br> or, via email, send a message with subject or body 'help' to<br> <a href="mailto:linux-cluster-request@redhat.com">linux-cluster-request@redhat.com</a><br> <br> You can reach the person managing the list at<br> <a href="mailto:linux-cluster-owner@redhat.com">linux-cluster-owner@redhat.com</a><br> <br> When replying, please edit your Subject line so it is more specific<br> than "Re: Contents of Linux-cluster digest..."<br> <br> <br> Today's Topics:<br> <br> 1. Re: Expected behaviour when service fails to stop (Ofer Inbar)<br> 2. meta-data problem: rg_test shows the wrong value (Ofer Inbar)<br> 3. Re: meta-data problem: rg_test shows the wrong value (Ofer Inbar)<br> 4. ccs/ricci cluster operation design (Etsuji Nakai)<br> 5. Re: RHCS resource agent: status interval vs. monitor interval<br> (Ofer Inbar)<br> <br> <br> ----------------------------------------------------------------------<br> <br> Message: 1<br> Date: Mon, 8 Aug 2011 18:14:25 -0400<br> From: Ofer Inbar <<a href="mailto:cos@aaaaa.org">cos@aaaaa.org</a>><br> To: linux clustering <<a href="mailto:linux-cluster@redhat.com">linux-cluster@redhat.com</a>><br> Subject: Re: [Linux-cluster] Expected behaviour when service fails to<br> stop<br> Message-ID: <<a href="mailto:20110808221425.GZ341@mip.aaaaa.org">20110808221425.GZ341@mip.aaaaa.org</a>><br> Content-Type: text/plain; charset=us-ascii<br> <br> Chris Alexander <<a href="mailto:chris.alexander@kusiri.com">chris.alexander@kusiri.com</a>> wrote:<br> > I was wondering what the expected behaviour of the cluster would be when a<br> > service cannot be shutdown safely. For example, if you request a service<br> > group to be relocated to another node in the cluster, if one of the services<br> > in that group fails to stop (causing a timeout?), what would the result be?<br> > I should imagine that the service would be marked as Failed, is this the<br> > case? I have been unable to find this particular scenario documented anywhere.<br> <br> This may be the documentation you're looking for:<br> <a href="https://fedorahosted.org/cluster/wiki/ServiceOperationalBehaviors" target="_blank">https://fedorahosted.org/cluster/wiki/ServiceOperationalBehaviors</a><br> <br> Under "Service States", the "failed" state is documented as:<br> failed - The service is presumed dead. This state occurs whenever a<br> resource's stop operation fails. Administrator must verify that there<br> are no allocated resources (mounted file systems, etc.) prior to<br> issuing a disable request. The only action which can take place from<br> this state is disable.<br> <br> So your intuition that the service is marked as "failed" if the stop<br> fails, is correct. However, I'm not sure what you mean by "causing a<br> timeout". What defines a stop failure is up to the resource agent<br> script (located in /usr/share/cluster) corresponding to the resource<br> it's trying to stop. If the "stop" operation from that script returns<br> a non-zero exit code, then the stop is considered to have failed.<br> -- Cos<br> <br> <br> <br> ------------------------------<br><br></blockquote></div> </div>