[Linux-cluster] Not restarting "max_restart" times before relocating failed service

Parvez Shaikh parvez.h.shaikh at gmail.com
Wed Oct 31 04:48:18 UTC 2012


Digimer,

Out put of rpm -q cman

cman-2.0.115-34.el5

There is no http mentioned in fencedevice, I think email client is
inserting it.


Thanks,
Parvez

On Wed, Oct 31, 2012 at 10:14 AM, Digimer <lists at alteeve.ca> wrote:

> What does 'rpm -q cman' return?
>
> This looks very odd;
> <fencedevice agent="fence_bladecenter"
> > ipaddr="mm-1.mydomain.com <http://mm-1.mydomain.com>"
>
> Please remove this for now;
>
>  <fence_daemon clean_start="1" post_fail_delay="0"
> > post_join_delay="0"/>
>
> In general, you don't want to assume a clean start. It's asking for
> trouble. The default delays are also sane. You can always come back to
> this later after this issue is resolved, if you wish.
>
> On 10/30/2012 09:20 PM, Parvez Shaikh wrote:
> > Hi Digimer,
> >
> > cman_tool version gives following -
> >
> > 6.2.0 config 22
> >
> > Cluster.conf -
> >
> > <?xml version="1.0"?>
> > <cluster alias="PARVEZ" config_version="22" name="PARVEZ">
> >         <clusternodes>
> >                 <clusternode name="myblade2" nodeid="2" votes="1">
> >                         <fence>
> >                                 <method name="1">
> >                                         <device blade="2"
> > missing_as_off="1" name="BladeCenterFencing-1"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> >                 <clusternode name="myblade1" nodeid="1" votes="1">
> >                         <fence>
> >                                 <method name="1">
> >                                         <device blade="1"
> > missing_as_off="1" name="BladeCenterFencing-1"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> >         </clusternodes>
> >         <cman expected_votes="1" two_node="1"/>
> >         <fencedevices>
> >                 <fencedevice agent="fence_bladecenter"
> > ipaddr="mm-1.mydomain.com <http://mm-1.mydomain.com>" login="XXXX"
> > name="BladeCenterFencing-1" passwd="XXXXX" shell_timeout="10"/>
> >         </fencedevices>
> >         <rm>
> >                 <resources>
> >                         <script file="/localhome/my/my_ha"
> > name="myHaAgent"/>
> >                         <ip address="192.168.51.51" monitor_link="1"/>
> >                 </resources>
> >                 <failoverdomains>
> >                         <failoverdomain name="mydomain" nofailback="1"
> > ordered="1" restricted="1">
> >                                 <failoverdomainnode name="myblade2"
> > priority="2"/>
> >                                 <failoverdomainnode name="myblade1"
> > priority="1"/>
> >                         </failoverdomain>
> >                 </failoverdomains>
> >                 <service autostart="0" domain="mydomain" exclusive="0"
> > max_restarts="5" name="mgmt" recovery="restart">
> >                         <script ref="myHaAgent"/>
> >                         <ip ref="192.168.51.51"/>
> >                 </service>
> >         </rm>
> >         <fence_daemon clean_start="1" post_fail_delay="0"
> > post_join_delay="0"/>
> > </cluster>
> >
> > Thanks,
> > Parvez
> >
> > On Tue, Oct 30, 2012 at 9:25 PM, Digimer <lists at alteeve.ca
> > <mailto:lists at alteeve.ca>> wrote:
> >
> >     On 10/30/2012 01:54 AM, Parvez Shaikh wrote:
> >     > Hi experts,
> >     >
> >     > I have defined a service as follows in cluster.conf -
> >     >
> >     >                 <service autostart="0" domain="mydomain"
> exclusive="0"
> >     > max_restarts="5" name="mgmt" recovery="restart">
> >     >                         <script ref="myHaAgent"/>
> >     >                         <ip ref="192.168.51.51"/>
> >     >                 </service>
> >     >
> >     > I mentioned max_restarts=5 hoping that if cluster fails to start
> >     service
> >     > 5 times, then it will relocate to another cluster node in failover
> >     domain.
> >     >
> >     > To check this, I turned down NIC hosting service's floating IP and
> got
> >     > following logs -
> >     >
> >     > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> Link for eth1:
> Not
> >     > detected
> >     > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on
> eth1...
> >     > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on
> eth1...
> >     > Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> status on ip
> >     > "192.168.51.51" returned 1 (generic error)
> >     > Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> Stopping service
> >     > service:mgmt
> >     > *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service
> >     service:mgmt is
> >     > recovering*
> >     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Recovering failed
> >     > service service:mgmt
> >     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> start on ip
> >     > "192.168.51.51" returned 1 (generic error)
> >     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #68: Failed to
> start
> >     > service:mgmt; return value: 1
> >     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Stopping service
> >     > service:mgmt
> >     > *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service
> >     service:mgmt is
> >     > recovering
> >     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #71: Relocating
> >     failed
> >     > service service:mgmt*
> >     > Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service
> >     service:mgmt is
> >     > stopped
> >     > Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service
> >     service:mgmt is
> >     > stopped
> >     >
> >     > But from the log it appears that cluster tried to restart service
> only
> >     > ONCE before relocating.
> >     >
> >     > I was expecting cluster to retry starting this service five times
> >     on the
> >     > same node before relocating
> >     >
> >     > Can anybody correct my understanding?
> >     >
> >     > Thanks,
> >     > Parvez
> >
> >     What version? Please paste your full cluster.conf.
> >
> >     --
> >     Digimer
> >     Papers and Projects: https://alteeve.ca/w/
> >     What if the cure for cancer is trapped in the mind of a person
> without
> >     access to education?
> >
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20121031/93a26d49/attachment.htm>


More information about the Linux-cluster mailing list