[Linux-cluster] Clustat shows wrong service status

Lon Hohberger lhh at redhat.com
Thu Feb 28 18:25:50 UTC 2008


On Thu, 2008-02-28 at 12:32 +0100, Agnieszka Kukałowicz wrote:
> > 
> > And I don't have situation that cman_tool nodes says:
> > " Last fenced   2008-02-27 15:24:16 by override"
> > 
> 
> I did more tests to find the cause of the problem.
> I found that clustat has problem with "restricted" failover domain.
> I tested 2 examples of my configuration: 
> 
> 1. failover domain is "restricted"
> 
> <rm>
>   <failoverdomains>
>     <failoverdomain name="VM_w1_failover" ordered="0" restricted="1">
>         <failoverdomainnode name="w1.local" priority="1"/>
>     </failoverdomain>
>     <failoverdomain name="VM_w2_failover" ordered="0" restricted="1">
>         <failoverdomainnode name="w2.local" priority="1"/>
> 
>     </failoverdomain>
>   </failoverdomains>
>   <resources/>
>   <vm autostart="1" domain="VM_w1_failover" exclusive="0"
> name="VM_Work11_RHEL51" path="/virts/w11" recovery="restart"/>
>   <vm autostart="1" domain="VM_w1_failover" exclusive="0"
> name="VM_Work12_RHEL51" path="/virts/w12" recovery="restart"/>
>   <vm autostart="0" domain="VM_w1_failover" exclusive="0"
> name="VM_Work13_RHEL51" path="/virts/w13" recovery="disable"/>
>   <vm autostart="1" domain="VM_w2_failover" exclusive="0"
> name="VM_Work21_RHEL51" path="/virts/w21" recovery="restart"/>
>   <vm autostart="0" domain="VM_w2_failover" exclusive="0"
> name="VM_Work22_RHEL51" path="/virts/w22" recovery="disable"/>
>   <vm autostart="0" domain="VM_w2_failover" exclusive="0"
> name="VM_Work23_RHEL51" path="/virts/w23" recovery="disable"/>
>         </rm>
> 
> Member Status: Quorate
> 
>   Member Name                        ID   Status
>   ------ ----                        ---- ------
>   w2.local		                    1 Online, Local, rgmanager
>   w1.local		                    2 Online, rgmanager
> 
>   Service Name         Owner (Last)                   State
>   ------- ----         ----- ------                   -----
>   vm:VM_Work11_RHEL51  w1.local				started
>   vm:VM_Work12_RHEL51  w1.local		            started
>   vm:VM_Work13_RHEL51  (none)                         disabled
>   vm:VM_Work21_RHEL51  w2.local		            started
>   vm:VM_Work22_RHEL51  (none)                         disabled
>   vm:VM_Work23_RHEL51  (none)                         disabled
> 
> After power off node w2.local and fencing "w2.local" by "w1.local"
> clustat still shows the service vm:VM_Work21_RHEL51 is started on
> w2.local
> 

Oh, so you had a restricted failover domain, and no nodes were online.
Looking at the code, it appears to be "correct weirdness" or perhaps
"known dysfunction":

http://sources.redhat.com/git/?p=cluster.git;a=blob;f=rgmanager/src/daemons/groups.c;h=a8325eec3425bbe124696bfe3dcd6f7ea1eebfea;hb=8e504af1adbadd2cb8fe7cab191d79a8d835540c#l742


                        /*
                         * TODO
                         * Mark a service as 'stopped' if no members in
its
                         * restricted fail-over domain are running.
                         */

Why it occurs:

The service states are typically only altered by a node which is taking
action on a service.  In this case, no nodes are online which are
allowed to act on the service - therefore, they do nothing.  What needs
to happen in this case is:

  * Check service failover domain config
  * Check nodes online
  * Mark service 'stopped' if no nodes are online capable of running
    the service

The reason this hasn't been fixed is because it doesn't actually cause
any service-availability problems - it just causes wrong reporting.

Could you file a bugzilla? :)

-- Lon




More information about the Linux-cluster mailing list