[Linux-cluster] Not restarting "max_restart" times before relocating failed service

Wed Oct 31 04:44:45 UTC 2012

What does 'rpm -q cman' return?

This looks very odd;
<fencedevice agent="fence_bladecenter"
> ipaddr="mm-1.mydomain.com <http://mm-1.mydomain.com>"

Please remove this for now;

 <fence_daemon clean_start="1" post_fail_delay="0"
> post_join_delay="0"/>

In general, you don't want to assume a clean start. It's asking for
trouble. The default delays are also sane. You can always come back to
this later after this issue is resolved, if you wish.

On 10/30/2012 09:20 PM, Parvez Shaikh wrote:
> Hi Digimer,
> 
> cman_tool version gives following -
> 
> 6.2.0 config 22
> 
> Cluster.conf -
> 
> <?xml version="1.0"?>
> <cluster alias="PARVEZ" config_version="22" name="PARVEZ">
>         <clusternodes>
>                 <clusternode name="myblade2" nodeid="2" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device blade="2"
> missing_as_off="1" name="BladeCenterFencing-1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="myblade1" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device blade="1"
> missing_as_off="1" name="BladeCenterFencing-1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_bladecenter"
> ipaddr="mm-1.mydomain.com <http://mm-1.mydomain.com>" login="XXXX"
> name="BladeCenterFencing-1" passwd="XXXXX" shell_timeout="10"/>
>         </fencedevices>
>         <rm>
>                 <resources>
>                         <script file="/localhome/my/my_ha"
> name="myHaAgent"/>
>                         <ip address="192.168.51.51" monitor_link="1"/>
>                 </resources>
>                 <failoverdomains>
>                         <failoverdomain name="mydomain" nofailback="1"
> ordered="1" restricted="1">
>                                 <failoverdomainnode name="myblade2"
> priority="2"/>
>                                 <failoverdomainnode name="myblade1"
> priority="1"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <service autostart="0" domain="mydomain" exclusive="0"
> max_restarts="5" name="mgmt" recovery="restart">
>                         <script ref="myHaAgent"/>
>                         <ip ref="192.168.51.51"/>
>                 </service>
>         </rm>
>         <fence_daemon clean_start="1" post_fail_delay="0"
> post_join_delay="0"/>
> </cluster>
> 
> Thanks,
> Parvez
> 
> On Tue, Oct 30, 2012 at 9:25 PM, Digimer <lists at alteeve.ca
> <mailto:lists at alteeve.ca>> wrote:
> 
>     On 10/30/2012 01:54 AM, Parvez Shaikh wrote:
>     > Hi experts,
>     >
>     > I have defined a service as follows in cluster.conf -
>     >
>     >                 <service autostart="0" domain="mydomain" exclusive="0"
>     > max_restarts="5" name="mgmt" recovery="restart">
>     >                         <script ref="myHaAgent"/>
>     >                         <ip ref="192.168.51.51"/>
>     >                 </service>
>     >
>     > I mentioned max_restarts=5 hoping that if cluster fails to start
>     service
>     > 5 times, then it will relocate to another cluster node in failover
>     domain.
>     >
>     > To check this, I turned down NIC hosting service's floating IP and got
>     > following logs -
>     >
>     > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> Link for eth1: Not
>     > detected
>     > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on eth1...
>     > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on eth1...
>     > Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> status on ip
>     > "192.168.51.51" returned 1 (generic error)
>     > Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> Stopping service
>     > service:mgmt
>     > *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service
>     service:mgmt is
>     > recovering*
>     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Recovering failed
>     > service service:mgmt
>     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> start on ip
>     > "192.168.51.51" returned 1 (generic error)
>     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #68: Failed to start
>     > service:mgmt; return value: 1
>     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Stopping service
>     > service:mgmt
>     > *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service
>     service:mgmt is
>     > recovering
>     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #71: Relocating
>     failed
>     > service service:mgmt*
>     > Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service
>     service:mgmt is
>     > stopped
>     > Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service
>     service:mgmt is
>     > stopped
>     >
>     > But from the log it appears that cluster tried to restart service only
>     > ONCE before relocating.
>     >
>     > I was expecting cluster to retry starting this service five times
>     on the
>     > same node before relocating
>     >
>     > Can anybody correct my understanding?
>     >
>     > Thanks,
>     > Parvez
> 
>     What version? Please paste your full cluster.conf.
> 
>     --
>     Digimer
>     Papers and Projects: https://alteeve.ca/w/
>     What if the cure for cancer is trapped in the mind of a person without
>     access to education?
> 
> 

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?