[Linux-cluster] manual fencing not working in RHEL4 branch

Tue Dec 6 02:04:38 UTC 2005

David,

I have tried the same init scripts with both ipmi and drac fencing, no
problems.  When I try manual fencing (it seems) that fence_manual
introduces some strangeness such that I run into my problem.

What is the problem:
When running manual fencing and doing failover testing, my secondary
node takes over the service without waiting for a fence_ack_manual. 
This all works perfectly with automatic fencing (ipmi, drac).

I have the same problem (most of the time) when I run this whole thing by hand:
1. nodeA: ccsd
2. nodeB: ccsd
3. nodeA: cman_tool join -w
4. nodeB: cman_tool join -w
5. nodeA: fence_tool join -w
6. nodeB: fence_tool join -w

When I start to see the problem, on the next reboot of both the
systems I can replace steps 5 & 6 with 'fenced -D'.  Now if I try to
failover a machine then manual fencing works perfectly (meaning forces
me to do a fence_ack_manual before a service fails over).  Next, I can
go in and change 'fenced -D' back to 'fence_tool join -w' and things
still work (forces me to run fence_ack_manual).

Next, if I replace the manual steps above with the init scripts then
manual fencing breaks all over again until I repeat the above steps.

Sounds like a timing issue around fence_manual?  Let me know if you
want me to try anything different.  Thanks for all your help.

On 11/30/05, David Teigland <teigland at redhat.com> wrote:
> On Tue, Nov 29, 2005 at 07:53:09PM -0700, busy admin wrote:
> > Here's a quick summary of what I've done and the results... to
> > simplify the config I've just been running ccsd and cman via init
> > scripts during boot and then manual executing 'fenced' or 'fence_tool'
> > or the fenced init script. The results I see are random success's and
> > failures!
> >
> > Initial test - reboot both systems and then, on both, executed 'fenced
> > -D' both systems joined the cluster and it was quorate. Rebooted one
> > node and to my surprise manual fencing worked, meaning
> > /tmp/fence_manual.fifo was created and I had to run 'fence_ack_manual'
> > on the other node. Tried again when the first node came back up and
> > again everything worked as expected.
> >
> > Additional testing - reboot both system and then, on both, executed
> > 'fence_tool join -w', both systems joined the cluster and it was
> > quorate. Rebooted one node and no fencing was done (nothing logged in
> > /var/log/messages).
> >
> > rebooted both systems again and this time executed 'fenced -D' on both
> > nodes... rebooted a node and fencing worked, was logged in
> > /var/log/messages and I had to manual run 'fence_ack_manual -n x64-5'.
> > when that node came back up again I again manually executed 'fenced
> > -D' on it and the cluster was quorate. I then rebooted the other node
> > and again fencing worked!
> >
> > so again I rebooted both nodes and executed 'fence_tool join -w' on
> > each... I again rebooted a node and fencing worked this time. fenced
> > msgs were logged to /var/log/messages, /tmp/fence_manual.fifo was
> > created and I had to execute 'fence_ack_manual -n x64-4' to recover.
> >
> > ... more testing w/mixed results ...
> >
> > modified fenced init script to execute 'fenced -D &' instead of
> > 'fence_tool join -w' and used chkconfig to turn it on on both systems
> > and rebooted them. both system restarted and joined the cluster. once
> > again I rebooted one node (x64-4) and fencing didn't work... nothing
> > was logged in /var/log/messages from fenced. see corresponding
> > /var/log/messages, fenced -D output and cluster.conf below.
>
> It's not clear what you're trying to test or what you expect to happen.
> Here's the optimal way to start up a cluster from a newly rebooted state:
>
> 1. nodeA: ccsd
> 2. nodeB: ccsd
> 3. nodeA: cman_tool join -w
> 4. nodeB: cman_tool join -w
> 5. nodeA: fence_tool join
> 6. nodeB: fence_tool join
>
> It's best if steps 5 & 6 only happen after both nodes are members of
> the cluster (see 'cman_tool nodes').  If this is the case, then no
> nodes should be fenced when starting up.
>
> If you use the init scripts you may loose a little control and certainty
> about what happens when, so I'd suggest using the commands directly until
> you know that things are running correctly, then try the init scripts.
>
> If, from the state above, nodeB fails, then nodeA should always fence
> nodeB.  With manual fencing, this means that a message should appear in
> nodeA's /var/log/messages telling you to reboot nodeB and run
> fence_ack_manual.  If, by chance, nodeB reboots and rejoins the cluster
> before you get to running fence_ack_manual, the fencing system on nodeA
> will just complete the fencing operation itself and you don't need to run
> fence_ack_manual (and if you try, the fence_ack_manual command will report
> an error.)
>
> Dave
>
>