[Linux-cluster] Not restarting "max_restart" times before relocating failed service

Wed Oct 31 17:35:05 UTC 2012

Hi,

I am using recovery=restart as evident from earlier attached cluster.conf

Thanks,
Parvez

On Wed, Oct 31, 2012 at 2:53 PM, emmanuel segura <emi2fast at gmail.com> wrote:

> Hello
>
> Maybe you missing recovery="restart" in your services
>
> 2012/10/31 Parvez Shaikh <parvez.h.shaikh at gmail.com>
>
>> Hi Digimer,
>>
>> cman_tool version gives following -
>>
>> 6.2.0 config 22
>>
>> Cluster.conf -
>>
>> <?xml version="1.0"?>
>> <cluster alias="PARVEZ" config_version="22" name="PARVEZ">
>>         <clusternodes>
>>                 <clusternode name="myblade2" nodeid="2" votes="1">
>>                         <fence>
>>                                 <method name="1">
>>                                         <device blade="2"
>> missing_as_off="1" name="BladeCenterFencing-1"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>                 <clusternode name="myblade1" nodeid="1" votes="1">
>>                         <fence>
>>                                 <method name="1">
>>                                         <device blade="1"
>> missing_as_off="1" name="BladeCenterFencing-1"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>         </clusternodes>
>>         <cman expected_votes="1" two_node="1"/>
>>         <fencedevices>
>>                 <fencedevice agent="fence_bladecenter" ipaddr="
>> mm-1.mydomain.com" login="XXXX" name="BladeCenterFencing-1"
>> passwd="XXXXX" shell_timeout="10"/>
>>         </fencedevices>
>>         <rm>
>>                 <resources>
>>                         <script file="/localhome/my/my_ha"
>> name="myHaAgent"/>
>>                         <ip address="192.168.51.51" monitor_link="1"/>
>>                 </resources>
>>                 <failoverdomains>
>>                         <failoverdomain name="mydomain" nofailback="1"
>> ordered="1" restricted="1">
>>                                 <failoverdomainnode name="myblade2"
>> priority="2"/>
>>                                 <failoverdomainnode name="myblade1"
>> priority="1"/>
>>                         </failoverdomain>
>>                 </failoverdomains>
>>                 <service autostart="0" domain="mydomain" exclusive="0"
>> max_restarts="5" name="mgmt" recovery="restart">
>>                         <script ref="myHaAgent"/>
>>                         <ip ref="192.168.51.51"/>
>>                 </service>
>>         </rm>
>>         <fence_daemon clean_start="1" post_fail_delay="0"
>> post_join_delay="0"/>
>> </cluster>
>>
>> Thanks,
>> Parvez
>>
>> On Tue, Oct 30, 2012 at 9:25 PM, Digimer <lists at alteeve.ca> wrote:
>>
>>> On 10/30/2012 01:54 AM, Parvez Shaikh wrote:
>>> > Hi experts,
>>> >
>>> > I have defined a service as follows in cluster.conf -
>>> >
>>> >                 <service autostart="0" domain="mydomain" exclusive="0"
>>> > max_restarts="5" name="mgmt" recovery="restart">
>>> >                         <script ref="myHaAgent"/>
>>> >                         <ip ref="192.168.51.51"/>
>>> >                 </service>
>>> >
>>> > I mentioned max_restarts=5 hoping that if cluster fails to start
>>> service
>>> > 5 times, then it will relocate to another cluster node in failover
>>> domain.
>>> >
>>> > To check this, I turned down NIC hosting service's floating IP and got
>>> > following logs -
>>> >
>>> > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> Link for eth1: Not
>>> > detected
>>> > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on eth1...
>>> > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on eth1...
>>> > Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> status on ip
>>> > "192.168.51.51" returned 1 (generic error)
>>> > Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> Stopping service
>>> > service:mgmt
>>> > *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service service:mgmt
>>> is
>>> > recovering*
>>> > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Recovering failed
>>> > service service:mgmt
>>> > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> start on ip
>>> > "192.168.51.51" returned 1 (generic error)
>>> > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #68: Failed to start
>>> > service:mgmt; return value: 1
>>> > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Stopping service
>>> > service:mgmt
>>> > *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service service:mgmt
>>> is
>>> > recovering
>>> > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #71: Relocating failed
>>> > service service:mgmt*
>>> > Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service service:mgmt is
>>> > stopped
>>> > Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service service:mgmt is
>>> > stopped
>>> >
>>> > But from the log it appears that cluster tried to restart service only
>>> > ONCE before relocating.
>>> >
>>> > I was expecting cluster to retry starting this service five times on
>>> the
>>> > same node before relocating
>>> >
>>> > Can anybody correct my understanding?
>>> >
>>> > Thanks,
>>> > Parvez
>>>
>>> What version? Please paste your full cluster.conf.
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/
>>> What if the cure for cancer is trapped in the mind of a person without
>>> access to education?
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121031/35d24531/attachment.htm>