[Linux-cluster] fence_ilo confused if both power supplies die

Coman ILIUT comaniliut at yahoo.com
Wed Dec 19 15:30:33 UTC 2007


We ran into the same problem. We ended up writing a new fence_ilo method. It sends a power reset via ILO. With power reset, if the node is powered down, nothing happens, and if it is powered up, it is powered down then up. Look at the HP ILO interface. HP has a PDF document about the ILO interface. Take a look at it.

Coman

----- Original Message ----
From: Miroslav Zubcic <mvz+rhcluster at nimium.hr>
To: linux-cluster at redhat.com
Sent: Thursday, December 13, 2007 4:55:11 AM
Subject: [Linux-cluster] fence_ilo confused if both power supplies die

Hi all,

Is this a bug? Should we report it on official RHN (I hate that slow
buggy oracle based portal!)

Summary:

We have 2-node cluster on HP ProLiant DL 380 G5 servers.

3 services in cluster:
    - FreeRADIUS + IP addr
    - Apache + IP addr + storage LUN
    - Postgres + IP addr + storage LUN

Fencing is done via HP ILO cards.

Couple days ago, both power supplies on one node died in short time
(well, obviously it can happen). Fenced daemon, ccsd, and cluster
generaly didn't reacted well on that, despite surviving non-real-life
acceptance tests where we pulled both power supplies out in test.
 Faulty
power supply is something different than missing power supply for HP
 ILO
card. ILO card continued to work on it's internal battery but "POWER
 ON"
action did not suceeded (POWER command was returning that power is
 off).

This situation has confused fence_ilo agent. Agent has seen that other
server is down, but it never returned sucess to cluster because it
FAILED TO POWER ON other server.

I think this is buggy behaviour. Who cares if fence agent cannot power
on again fenced node, why it just didn't give up? Here is relevant part
of the log on healthy node which tried to fence other node.


Dec 10 03:37:14 aoc01 kernel: CMAN: removing node aoc02 from the
 cluster
: Missed too many heartbeats
Dec 10 03:37:14 aoc01 fenced[3012]: aoc02 not a cluster member after 0
sec post_fail_delay
Dec 10 03:37:14 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:50 aoc01 fenced[3012]: agent "fence_ilo" reports: failed
 to
turn on
Dec 10 03:37:50 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:37:55 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:55 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:37:55 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 03:37:55 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:38:00 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:38:00 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:38:00 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor

Dec 10 05:42:13 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:18 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:18 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:18 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:18 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:23 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:23 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:23 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:23 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:28 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:28 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:28 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor



-- 
Miroslav


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster






      Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! Canada Messenger at http://ca.beta.messenger.yahoo.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071219/11be1474/attachment.htm>


More information about the Linux-cluster mailing list