[Linux-cluster] fence_ipmilan issue

Mon Nov 5 19:36:16 UTC 2007

In order to get the node back into the cluster, I had to reboot all
the nodes.  Not exactly what I want to have happen.  Still not sure
why the rgmanager was hung.
Instead of calling fence_ipmilan, decided to see what would happen if
I pulled the ethernet cable to a node running a service.  From
/var/log/messages on one node I see the following:

ov  5 12:46:17 isc0 openais[2870]: [TOTEM] Creating commit token
because I am the rep.
Nov  5 12:46:17 isc0 openais[2870]: [TOTEM] Saving state aru 97 high
seq received 97
Nov  5 12:46:17 isc0 openais[2870]: [TOTEM] entering COMMIT state.
Nov  5 12:46:17 isc0 openais[2870]: [TOTEM] entering GATHER state from 12.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] entering GATHER state from 11.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] Creating commit token
because I am the rep.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] entering COMMIT state.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] entering RECOVERY state.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] position [0] member 172.16.127.122:
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] previous ring seq 52 rep
172.16.127.122
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] aru 97 high delivered 97
received flag 0
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] position [1] member 172.16.127.124:
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] previous ring seq 52 rep
172.16.127.122
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] aru 97 high delivered 97
received flag 0
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] Did not need to originate
any messages in recovery.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] Storing new sequence id for ring 3c
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] Sending initial ORF token
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] New Configuration:
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ]     r(0) ip(172.16.127.122)
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ]     r(0) ip(172.16.127.124)
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] Members Left:
Nov  5 12:46:22 isc0 kernel: dlm: closing connection to node 2
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ]     r(0) ip(172.16.127.123)
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] Members Joined:
Nov  5 12:46:22 isc0 openais[2870]: [SYNC ] This node is within the
primary component and will provi
de service.
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] New Configuration:
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ]     r(0) ip(172.16.127.122)
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ]     r(0) ip(172.16.127.124)
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] Members Left:
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] Members Joined:
Nov  5 12:46:22 isc0 openais[2870]: [SYNC ] This node is within the
primary component and will provi
de service.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] entering OPERATIONAL state.
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] got nodejoin message 172.16.127.122
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] got nodejoin message 172.16.127.124
Nov  5 12:46:22 isc0 openais[2870]: [CPG  ] got joinlist message from node 1
Nov  5 12:46:22 isc0 openais[2870]: [CPG  ] got joinlist message from node 3
Nov  5 12:46:42 isc0 fenced[2889]: isc1 not a cluster member after 20
sec post_fail_delay
Nov  5 12:46:42 isc0 fenced[2889]: fencing node "isc1"
Nov  5 12:46:42 isc0 fenced[2889]: agent "fence_ipmilan" reports:
Rebooting machine @ IPMI:172.16.15
8.160...Failed
Nov  5 12:46:42 isc0 fenced[2889]: fence "isc1" failed
Nov  5 12:46:47 isc0 fenced[2889]: fencing node "isc1"
Nov  5 12:46:48 isc0 fenced[2889]: agent "fence_ipmilan" reports:
Rebooting machine @ IPMI:172.16.15
8.160...Failed

The last 3 lines continue to repeat.  Any clues as to what might be
wrong?  Here's an updated cluster.conf
<?xml version="1.0"?>
<cluster alias="ices_nfscluster" config_version="100" name="nfs_cluster">
        <fence_daemon post_fail_delay="20" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="isc0" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="1" name="iisc0"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="isc1" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="1" name="iisc1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="isc2" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="1" name="iisc2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="none"
ipaddr="172.16.158.159" login="root" name="iisc0" passwd="changeme"/>
                <fencedevice agent="fence_ipmilan" auth="none"
ipaddr="172.16.158.160" login="root" name="iisc1" passwd="changeme"/>
                <fencedevice agent="fence_ipmilan" auth="none"
ipaddr="171.16.158.161" login="root" name="iisc2" passwd="changeme"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="fotest" ordered="1"
restricted="1">
                                <failoverdomainnode name="isc0" priority="1"/>
                                <failoverdomainnode name="isc1" priority="1"/>
                                <failoverdomainnode name="isc2" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="172.16.127.15" monitor_link="1"/>
                        <ip address="172.16.127.17" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="fotest" name="nfstest"
recovery="restart">
                        <fs device="/dev/ices-fs/test" force_fsck="0"
force_unmount="1" fsid="13584" fstype="ext3" mountpoint="/export/test"
name="testfs" options="" self_fence="0"/>
                        <nfsexport name="test_export">
                                <nfsclient name="test_export"
options="async,rw,fsid=20" path="/export/test"
target="128.83.68.0/24"/>
                        </nfsexport>
                        <ip ref="172.16.127.15"/>
                </service>
                <service autostart="1" domain="fotest" name="nfsices"
recovery="relocate">
                        <fs device="/dev/ices-fs/ices" force_fsck="0"
force_unmount="1" fsid="44096" fstype="ext3"
mountpoint="/export/cices" name="nfsfs" options="" self_fence="0"/>
                        <nfsexport name="nfsexport">
                                <nfsclient name="nfsclient"
options="async,fsid=25,rw" path="/export/cices"
target="128.83.68.0/24"/>
                        </nfsexport>
                        <ip ref="172.16.127.17"/>
                </service>
        </rm>
</cluster>

Thanks,

Stew