[Linux-cluster] Re: clusvcadm -U returns "Temporary failure" on vm service

Thu Nov 5 23:11:32 UTC 2009

I have dug a little deeper on this:

in rg_state.c the function _svc_freeze( ... ) the switch checking for  
the service status only lists as valid RG_STATE_ 
{STOPPED,STARTED,DISABLED}

Based on the output of clustat my services are in RG_STATE_MIGRATE

Which means execution fails over to the default case which unlocks the  
group and returns RG_EAGAIN which is generating the "Temporary  
failure; try again" note below.

What this means is that it is possible, given the scenario outlined  
below to have a service in the "migrating" state with the frozen flag  
set.  Once this state is entered the rg can no longer be unfrozen  
because the unfreeze code expects it to eventually undergo a state  
change at which point you can unfreeze it.  Problem is now that it's  
frozen it can't be stopped, disabled, etc. and so I can't force a  
state change.

I saw reference to a patch to prevent migration of frozen groups, but  
either I'm not using that release of the code or it doesn't apply to  
the situation I outlined below.

--AB

On Nov 3, 2009, at 1:03 PM, Aaron Benner wrote:

> All,
>
> I have a problem that I can't find documentation on and has me  
> baffled.
>
> I have a 3 node cluster running xen with multiple domU enabled as  
> cluster services.  The individual services are set to have a node  
> affinity using resource groups (see cluster.conf below) and live  
> migration is enabled.
>
> I had migrated two domU off of one of the cluster nodes in  
> anticipation of a power-cycle and network reconfig.  Before bringing  
> up the node that had been reconfigured I froze (clusvcadm -Z ...)  
> the domU in question so that when the newly reconfigured node came  
> up they would not migrate back to their preferred host, or at least  
> that's what I *THOUGHT* -Z would do.
>
> I booted up reconfigured node, and ignoring their frozen state the  
> rgmanager on the rebooting node initiated a migration of the domUs.   
> The migration finished and the virtuals resumed operation on the  
> reconfigured host.  The problem is now rgmanager is showing those  
> resrouce groups as having state "migrating" (even though there are  
> no migration processes still active) and clusvcadm -U ... returns  
> the following:
>
> "Local machine unfreezing vm:SaturnE...Temporary failure; try again"
>
> I get this message on all of the cluster nodes.  I'm not sure if  
> this is coming from clusvcadm, vm.sh, or some other piece of the  
> cluster puzzle.  Is there any way to get rgmanager to realize that  
> these resource groups are no longer migrating and as such can be  
> unfrozen?  Is that even my problem?  Can I fix this with anything  
> other than a complete power down of the cluster (disaster)?
>
> --AB
> <?xml version="1.0"?>
> <cluster alias="plieadies" config_version="66" name="plieadies">
>        <fence_daemon clean_start="0" post_fail_delay="0"  
> post_join_delay="180"/>
>        <clusternodes>
>                <clusternode name="plieadies3.atmexpress.com"  
> nodeid="1" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="switchedpdu2"  
> port="6"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="plieadies2.atmexpress.com"  
> nodeid="2" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="switchedpdu1"  
> port="13"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="plieadies1.atmexpress.com"  
> nodeid="3" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="switchedpdu2"  
> port="12"/>
>                                </method>
>                        </fence>
>                </clusternode>
>        </clusternodes>
>        <cman/>
>        <fencedevices>
>                <fencedevice agent="fence_apc" [snip]/>
>                <fencedevice agent="fence_apc" [snip]/>
>        </fencedevices>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="bias-plieadies2"  
> nofailback="0" ordered="0" restricted="0">
>                                <failoverdomainnode name="plieadies2.atmexpress.com 
> " priority="1"/>
>                        </failoverdomain>
>                        <failoverdomain name="bias-plieadies1"  
> nofailback="0" ordered="0" restricted="0">
>                                <failoverdomainnode name="plieadies1.atmexpress.com 
> " priority="1"/>
>                        </failoverdomain>
>                        <failoverdomain name="bias-plieadies3"  
> nofailback="0" ordered="0" restricted="0">
>                                <failoverdomainnode name="plieadies3.atmexpress.com 
> " priority="1"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources/>
>                <vm autostart="0" domain="bias-plieadies3"  
> exclusive="0" max_restarts="0" migrate="live" name="SaturnX" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>                <vm autostart="1" domain="bias-plieadies2"  
> exclusive="0" max_restarts="0" migrate="live" name="SaturnC" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>                <vm autostart="1" domain="bias-plieadies3"  
> exclusive="0" max_restarts="0" migrate="live" name="SaturnE" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>                <vm autostart="1" domain="bias-plieadies3"  
> exclusive="0" max_restarts="0" migrate="live" name="SaturnF" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>                <vm autostart="1" domain="bias-plieadies2"  
> exclusive="0" max_restarts="0" migrate="live" name="SaturnD" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>                <vm autostart="1" domain="bias-plieadies1"  
> exclusive="0" max_restarts="0" migrate="live" name="SaturnA" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>                <vm autostart="1" domain="bias-plieadies1"  
> exclusive="0" max_restarts="0" migrate="live" name="Orion1" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>                <vm autostart="1" domain="bias-plieadies2"  
> exclusive="0" max_restarts="0" migrate="live" name="Orion2" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>                <vm autostart="1" domain="bias-plieadies3"  
> exclusive="0" max_restarts="0" migrate="live" name="Orion3" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>                <vm autostart="1" domain="bias-plieadies1"  
> exclusive="0" max_restarts="0" migrate="live" name="SaturnB" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>                <vm autostart="1" domain="bias-plieadies1"  
> exclusive="0" max_restarts="0" migrate="live" name="Pluto" path="/ 
> etc/xen" recovery="restart" restart_expire_time="0"/>
>        </rm>
>        <fence_xvmd/>
> </cluster>
>