[Linux-cluster] Active-Active configuration of arbitrary services

Fri Oct 19 19:16:01 UTC 2007

On Fri, 2007-10-19 at 14:53 +0000, Glenn Aycock wrote:
> We are running RHCS on RHEL 4.5 and have a basic 2-node HA cluster
> configuration for a critical application in place and functional. The
> config looks like this:

> <?xml version="1.0"?>
> <cluster config_version="16" name="routing_cluster">
>         <fence_daemon post_fail_delay="0" post_join_delay="10"/>
>         <clusternodes>
>                 <clusternode name="host1" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="manual" nodename="host1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="host2" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="manual" nodename="host2"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman dead_node_timeout="10" expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_manual" name="manual"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="routing_servers" ordered="1" restricted="1">
>                                 <failoverdomainnode name="host1" priority="1"/>
>                                 <failoverdomainnode name="host2" priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <script file="/etc/init.d/rsd" name="rsd"/>
>                         <ip address="123.456.78.9" monitor_link="1"/>
>                 </resources>
>                 <service autostart="1" domain="routing_servers" name="routing_daemon" recovery="relocate">
>                         <ip ref="123.456.78.9"/>
>                         <script ref="rsd"/>
>                 </service>
>         </rm>
> </cluster>

> The cluster takes about 15-20 seconds to notice that the daemon is
> down and migrate it to the other node. However, due to slow migration
> and startup time, we now require the daemon on the secondary to be
> active and only transfer the VIP in case it aborts on the primary. 

You could start by decreasing the 'status check' time by
tweaking /usr/share/cluster/script.sh "status" interval:

        <action name="status" interval="30s" timeout="0"/>
        <action name="monitor" interval="30s" timeout="0"/>

Change to:

        <action name="status" interval="10s" timeout="0"/>
        <action name="monitor" interval="10s" timeout="0"/>

(as an example...)

You can also make a wrapper script which doesn't do the stop phase of
your rsd script unless it's already in a non-working state (to prevent
stop-before-start that rgmanager normally does):

#!/bin/bash

SCR=/etc/init.d/rsd

case $1 in
start)
    # Should be a no-op if already running
    $SCR start 
    exit $?
    ;;
stop)
    # Don't actually stop it if it's running; just
    # clean it up if it's broken.  This app is 
    # safe to run on multiple nodes
    $SCR status
    if [ $? -ne 0 ]; then
        $SCR stop
        exit $?
    fi
    exit 0
    ;;
status)
    $SCR status
    exit $?
    ;;
esac

exit 0

(Note: rsd will have to be enabled on boot for this to work).

-- Lon