[Linux-cluster] Not restarting "max_restart" times before relocating failed service

Wed Oct 31 09:23:20 UTC 2012

Hello

Maybe you missing recovery="restart" in your services

2012/10/31 Parvez Shaikh <parvez.h.shaikh at gmail.com>

> Hi Digimer,
>
> cman_tool version gives following -
>
> 6.2.0 config 22
>
> Cluster.conf -
>
> <?xml version="1.0"?>
> <cluster alias="PARVEZ" config_version="22" name="PARVEZ">
>         <clusternodes>
>                 <clusternode name="myblade2" nodeid="2" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device blade="2"
> missing_as_off="1" name="BladeCenterFencing-1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="myblade1" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device blade="1"
> missing_as_off="1" name="BladeCenterFencing-1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_bladecenter" ipaddr="
> mm-1.mydomain.com" login="XXXX" name="BladeCenterFencing-1"
> passwd="XXXXX" shell_timeout="10"/>
>         </fencedevices>
>         <rm>
>                 <resources>
>                         <script file="/localhome/my/my_ha"
> name="myHaAgent"/>
>                         <ip address="192.168.51.51" monitor_link="1"/>
>                 </resources>
>                 <failoverdomains>
>                         <failoverdomain name="mydomain" nofailback="1"
> ordered="1" restricted="1">
>                                 <failoverdomainnode name="myblade2"
> priority="2"/>
>                                 <failoverdomainnode name="myblade1"
> priority="1"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <service autostart="0" domain="mydomain" exclusive="0"
> max_restarts="5" name="mgmt" recovery="restart">
>                         <script ref="myHaAgent"/>
>                         <ip ref="192.168.51.51"/>
>                 </service>
>         </rm>
>         <fence_daemon clean_start="1" post_fail_delay="0"
> post_join_delay="0"/>
> </cluster>
>
> Thanks,
> Parvez
>
> On Tue, Oct 30, 2012 at 9:25 PM, Digimer <lists at alteeve.ca> wrote:
>
>> On 10/30/2012 01:54 AM, Parvez Shaikh wrote:
>> > Hi experts,
>> >
>> > I have defined a service as follows in cluster.conf -
>> >
>> >                 <service autostart="0" domain="mydomain" exclusive="0"
>> > max_restarts="5" name="mgmt" recovery="restart">
>> >                         <script ref="myHaAgent"/>
>> >                         <ip ref="192.168.51.51"/>
>> >                 </service>
>> >
>> > I mentioned max_restarts=5 hoping that if cluster fails to start service
>> > 5 times, then it will relocate to another cluster node in failover
>> domain.
>> >
>> > To check this, I turned down NIC hosting service's floating IP and got
>> > following logs -
>> >
>> > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> Link for eth1: Not
>> > detected
>> > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on eth1...
>> > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on eth1...
>> > Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> status on ip
>> > "192.168.51.51" returned 1 (generic error)
>> > Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> Stopping service
>> > service:mgmt
>> > *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service service:mgmt is
>> > recovering*
>> > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Recovering failed
>> > service service:mgmt
>> > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> start on ip
>> > "192.168.51.51" returned 1 (generic error)
>> > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #68: Failed to start
>> > service:mgmt; return value: 1
>> > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Stopping service
>> > service:mgmt
>> > *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service service:mgmt is
>> > recovering
>> > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #71: Relocating failed
>> > service service:mgmt*
>> > Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service service:mgmt is
>> > stopped
>> > Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service service:mgmt is
>> > stopped
>> >
>> > But from the log it appears that cluster tried to restart service only
>> > ONCE before relocating.
>> >
>> > I was expecting cluster to retry starting this service five times on the
>> > same node before relocating
>> >
>> > Can anybody correct my understanding?
>> >
>> > Thanks,
>> > Parvez
>>
>> What version? Please paste your full cluster.conf.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121031/f7233fe2/attachment.htm>