[Linux-cluster] manual fencing not working in RHEL4 branch

Wed Nov 30 02:53:09 UTC 2005

On 11/28/05, David Teigland <teigland at redhat.com> wrote:
> On Mon, Nov 28, 2005 at 02:07:52PM -0700, busy admin wrote:
> > I'm doing some testing with manual fencing and here's what I've found:
> >
> > Using the RHEL4 branch of code and running on a RHEL4 U1 system manual
> > fencing doesn't seem to work. If I have a simple two node cluster and
> > force a reboot of the primary node (node1 - running the service) the
> > service fails over to the secondary node (node2) and starts running
> > without me having to execute 'fence_ack_manual -n node1'. In fact if I
> > look in look at the /tmp filesystem I don't see the fifo file ever
> > being created. So in fact if I try to execute 'fence_ack_manual' it
> > complains about the fifo file not existing. So it's as if fenced
> > calling fence_manual isn't creating the fifo file to begin with.
> >
> > Using the STABLE branch and building against and 2.6.12 kernel manual
> > fencing works as expected. When I force a reboot of the system running
> > the service, the service doesn't fail over until I manually execute
> > 'fence_ack_manual', then the service starts sucessfully on the
> > remaining node.
> >
> > Any comments? Anyone else observe this same behavior? Is this just
> > broken in the RHEL4 branch?
>
> I can't recall or see any changes since RHEL4U1 that would explain this.
> Could you run fenced -D and send the output?

Here's a quick summary of what I've done and the results... to
simplify the config I've just been running ccsd and cman via init
scripts during boot and then manual executing 'fenced' or 'fence_tool'
or the fenced init script. The results I see are random success's and
failures!

Initial test - reboot both systems and then, on both, executed 'fenced
-D' both systems joined the cluster and it was quorate. Rebooted one
node and to my surprise manual fencing worked, meaning
/tmp/fence_manual.fifo was created and I had to run 'fence_ack_manual'
on the other node. Tried again when the first node came back up and
again everything worked as expected.

Additional testing - reboot both system and then, on both, executed
'fence_tool join -w', both systems joined the cluster and it was
quorate. Rebooted one node and no fencing was done (nothing logged in
/var/log/messages).

rebooted both systems again and this time executed 'fenced -D' on both
nodes... rebooted a node and fencing worked, was logged in
/var/log/messages and I had to manual run 'fence_ack_manual -n x64-5'.
when that node came back up again I again manually executed 'fenced
-D' on it and the cluster was quorate. I then rebooted the other node
and again fencing worked!

so again I rebooted both nodes and executed 'fence_tool join -w' on
each... I again rebooted a node and fencing worked this time. fenced
msgs were logged to /var/log/messages, /tmp/fence_manual.fifo was
created and I had to execute 'fence_ack_manual -n x64-4' to recover.

... more testing w/mixed results ...

modified fenced init script to execute 'fenced -D &' instead of
'fence_tool join -w' and used chkconfig to turn it on on both systems
and rebooted them. both system restarted and joined the cluster. once
again I rebooted one node (x64-4) and fencing didn't work... nothing
was logged in /var/log/messages from fenced. see corresponding
/var/log/messages, fenced -D output and cluster.conf below.

---8<--- /var/log/messages
Nov 29 16:45:56 x64-5 kernel: CMAN: removing node x64-4 from the
cluster : Shutdown
Nov 29 16:46:34 x64-5 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down
Nov 29 16:49:22 x64-5 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down
Nov 29 16:49:24 x64-5 kernel: e1000: eth1: e1000_watchdog: NIC Link is
Up 1000 Mbps Full Duplex
Nov 29 16:50:08 x64-5 kernel: CMAN: node x64-4 rejoining
--->8---

---8<--- fenced -D output
fenced: 1133307956 start:
fenced: 1133307956   event_id    = 2
fenced: 1133307956   last_stop   = 1
fenced: 1133307956   last_start  = 2
fenced: 1133307956   last_finish = 1
fenced: 1133307956   node_count  = 1
fenced: 1133307956   start_type  = leave
fenced: 1133307956 members:
fenced: 1133307956   nodeid =  2 "x64-5"
fenced: 1133307956 do_recovery stop 1 start 2 finish 1
fenced: 1133307956 add node 1 to list 2
fenced: 1133307956 finish:
fenced: 1133307956   event_id    = 2
fenced: 1133307956   last_stop   = 1
fenced: 1133307956   last_start  = 2
fenced: 1133307956   last_finish = 2
fenced: 1133307956   node_count  = 0
fenced: 1133308214 stop:
fenced: 1133308214   event_id    = 0
fenced: 1133308214   last_stop   = 2
fenced: 1133308214   last_start  = 2
fenced: 1133308214   last_finish = 2
fenced: 1133308214   node_count  = 0
fenced: 1133308214 start:
fenced: 1133308214   event_id    = 3
fenced: 1133308214   last_stop   = 2
fenced: 1133308214   last_start  = 3
fenced: 1133308214   last_finish = 2
fenced: 1133308214   node_count  = 2
fenced: 1133308214   start_type  = join
fenced: 1133308214 members:
fenced: 1133308214   nodeid =  2 "x64-5"
fenced: 1133308214   nodeid =  1 "x64-4"
fenced: 1133308214 do_recovery stop 2 start 3 finish 2
fenced: 1133308214 finish:
fenced: 1133308214   event_id    = 3
fenced: 1133308214   last_stop   = 2
fenced: 1133308214   last_start  = 3
fenced: 1133308214   last_finish = 3
fenced: 1133308214   node_count  = 0
--->8---

---8<--- cluster.conf
<?xml version="1.0"?>
<cluster config_version="22" name="testcluster">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="x64-5">
                        <fence>
                                <method name="single">
                                        <device name="manual" ipaddr="x64-5"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="x64-4">
                        <fence>
                                <method name="single">
                                        <device name="manual" ipaddr="x64-4"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_manual" name="manual"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources>
                        <ip address="10.0.0.120" monitor_link="1"/>
                </resources>
                <service name="ipservice">
                        <ip ref="10.0.0.120"/>
                </service>
        </rm>
</cluster>
--->8---