[Linux-cluster] Service Recovery Failure

Mon Nov 26 22:36:15 UTC 2007

I just performed a test which fail miserably. I have two nodes (node 2 
and node 3) and did a test of a nic failure and expected a fencing race 
with a good outcome. The good node did not attempt to fence the bad node 
(although the bad one did make an attempt as expected). At the same time 
it also did not take over the service ( really bad ).

For my simulated nic failure, I unplugged the cables from the public nic 
on node 3. Node 3 was running the IP Address service. The two nodes are 
also connected to a private network solely for the purpose of 
communicating with the power switches.

Here's all the data I know to fetch, what else can I provide? The very 
last line (from the disconnected node) is my fence script hack properly 
stopping fencing since the public nic is out.

Further down the log (not shown here) the fencing is attempted 
repeatedly. It was my understanding that it would only try each method once.

I was expecting both nodes to try to fence and the one with the good 
public nic connection to win. I had a split brain (according to clustat) 
and the log shows the same sequence of events on both except than fence 
didn't happen on one.

Help!

clustat from node 2 before failure test:

Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  205.234.65.132                        2 Online, Local, rgmanager
  205.234.65.133                        3 Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  service:Web Server A 205.234.65.133                 started

clustat from node 2 after pulling node 3:

Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  205.234.65.132                        2 Online, Local, rgmanager
  205.234.65.133                        3 Offline

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  service:Web Server A 205.234.65.133                 started

/etc/cluster/cluster.conf:

<?xml version="1.0"?>
<cluster alias="bxwa" config_version="8" name="bxwa">
        <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="205.234.65.132" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="RackPDU1" 
option="off" port="2"/>
                                        <device name="RackPDU2" 
option="off" port="2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="205.234.65.133" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="RackPDU1" 
option="off" port="3"/>
                                        <device name="RackPDU2" 
option="off" port="3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.7.11" 
login="root" name="RackPDU1" passwd_script="/root/cluster/rack_pdu"/>
                <fencedevice agent="fence_apc" ipaddr="192.168.7.12" 
login="root" name="RackPDU2" passwd_script="/root/cluster/rack_pdu"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
                <service autostart="1" exclusive="0" name="Web Server 
Address" recovery="relocate">
                        <ip address="205.234.65.138" monitor_link="1"/>
                </service>
        </rm>
</cluster>

Node 2, /var/log/messages:
openais[9498]: [TOTEM] The token was lost in the OPERATIONAL state.
openais[9498]: [TOTEM] Receive multicast socket recv buffer size (288000 
bytes).
openais[9498]: [TOTEM] Transmit multicast socket send buffer size 
(262142 bytes).
openais[9498]: [TOTEM] entering GATHER state from 2.
openais[9498]: [TOTEM] entering GATHER state from 0.
openais[9498]: [TOTEM] Creating commit token because I am the rep.
openais[9498]: [TOTEM] Saving state aru 47 high seq received 47
openais[9498]: [TOTEM] Storing new sequence id for ring 6c
openais[9498]: [TOTEM] entering COMMIT state.
openais[9498]: [TOTEM] entering RECOVERY state.
openais[9498]: [TOTEM] position [0] member 205.234.65.132:
openais[9498]: [TOTEM] previous ring seq 104 rep 205.234.65.132
openais[9498]: [TOTEM] aru 47 high delivered 47 received flag 1
openais[9498]: [TOTEM] Did not need to originate any messages in recovery.
openais[9498]: [TOTEM] Sending initial ORF token
openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
openais[9498]: [CLM  ] New Configuration:
kernel: dlm: closing connection to node 3
fenced[9568]: 205.234.65.133 not a cluster member after 0 sec 
post_fail_delay
openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
openais[9498]: [CLM  ] Members Left:
openais[9498]: [CLM  ]     r(0) ip(205.234.65.133)
openais[9498]: [CLM  ] Members Joined:
openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
openais[9498]: [CLM  ] New Configuration:
openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
openais[9498]: [CLM  ] Members Left:
openais[9498]: [CLM  ] Members Joined:
openais[9498]: [SYNC ] This node is within the primary component and 
will provide service.
openais[9498]: [TOTEM] entering OPERATIONAL state.
openais[9498]: [CLM  ] got nodejoin message 205.234.65.132
openais[9498]: [CPG  ] got joinlist message from node 2

Node 3, /var/log/messages:
kernel: bonding: bond0: now running without any active interface !
openais[2921]: [TOTEM] The token was lost in the OPERATIONAL state.
openais[2921]: [TOTEM] Receive multicast socket recv buffer size (288000 
bytes).
openais[2921]: [TOTEM] Transmit multicast socket send buffer size 
(262142 bytes).
openais[2921]: [TOTEM] entering GATHER state from 2.
clurgmgrd: [3759]: <warning> Link for bond0: Not detected
clurgmgrd: [3759]: <warning> No link on bond0...
clurgmgrd[3759]: <notice> status on ip "205.234.65.138" returned 1 
(generic error)
clurgmgrd[3759]: <notice> Stopping service service:Web Server Address
openais[2921]: [TOTEM] entering GATHER state from 0.
openais[2921]: [TOTEM] Creating commit token because I am the rep.
openais[2921]: [TOTEM] Saving state aru 47 high seq received 47
openais[2921]: [TOTEM] Storing new sequence id for ring 6c
openais[2921]: [TOTEM] entering COMMIT state.
openais[2921]: [TOTEM] entering RECOVERY state.
openais[2921]: [TOTEM] position [0] member 205.234.65.133:
openais[2921]: [TOTEM] previous ring seq 104 rep 205.234.65.132
openais[2921]: [TOTEM] aru 47 high delivered 47 received flag 1
openais[2921]: [TOTEM] Did not need to originate any messages in recovery.
openais[2921]: [TOTEM] Sending initial ORF token
openais[2921]: [CLM  ] CLM CONFIGURATION CHANGE
openais[2921]: [CLM  ] New Configuration:
kernel: dlm: closing connection to node 2
openais[2921]: [CLM  ]     r(0) ip(205.234.65.133)
openais[2921]: [CLM  ] Members Left:
fenced[2937]: 205.234.65.132 not a cluster member after 0 sec 
post_fail_delay
openais[2921]: [CLM  ]     r(0) ip(205.234.65.132)
openais[2921]: [CLM  ] Members Joined:
fenced[2937]: fencing node "205.234.65.132"
openais[2921]: [CLM  ] CLM CONFIGURATION CHANGE
openais[2921]: [CLM  ] New Configuration:
openais[2921]: [CLM  ]     r(0) ip(205.234.65.133)
openais[2921]: [CLM  ] Members Left:
openais[2921]: [CLM  ] Members Joined:
openais[2921]: [SYNC ] This node is within the primary component and 
will provide service.
openais[2921]: [TOTEM] entering OPERATIONAL state.
openais[2921]: [CLM  ] got nodejoin message 205.234.65.133
openais[2921]: [CPG  ] got joinlist message from node 3
fenced[2937]: agent "fence_apc" reports: Can not ping gateway