[Linux-cluster] fence_ilo confused if both power supplies die

Miroslav Zubcic mvz+rhcluster at nimium.hr
Thu Dec 13 09:55:11 UTC 2007


Hi all,

Is this a bug? Should we report it on official RHN (I hate that slow
buggy oracle based portal!)

Summary:

We have 2-node cluster on HP ProLiant DL 380 G5 servers.

3 services in cluster:
	- FreeRADIUS + IP addr
	- Apache + IP addr + storage LUN
	- Postgres + IP addr + storage LUN

Fencing is done via HP ILO cards.

Couple days ago, both power supplies on one node died in short time
(well, obviously it can happen). Fenced daemon, ccsd, and cluster
generaly didn't reacted well on that, despite surviving non-real-life
acceptance tests where we pulled both power supplies out in test. Faulty
power supply is something different than missing power supply for HP ILO
card. ILO card continued to work on it's internal battery but "POWER ON"
action did not suceeded (POWER command was returning that power is off).

This situation has confused fence_ilo agent. Agent has seen that other
server is down, but it never returned sucess to cluster because it
FAILED TO POWER ON other server.

I think this is buggy behaviour. Who cares if fence agent cannot power
on again fenced node, why it just didn't give up? Here is relevant part
of the log on healthy node which tried to fence other node.


Dec 10 03:37:14 aoc01 kernel: CMAN: removing node aoc02 from the cluster
: Missed too many heartbeats
Dec 10 03:37:14 aoc01 fenced[3012]: aoc02 not a cluster member after 0
sec post_fail_delay
Dec 10 03:37:14 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:50 aoc01 fenced[3012]: agent "fence_ilo" reports: failed to
turn on
Dec 10 03:37:50 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:37:55 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:55 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:37:55 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 03:37:55 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:38:00 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:38:00 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:38:00 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor

Dec 10 05:42:13 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:18 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:18 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:18 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:18 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:23 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:23 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:23 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:23 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:28 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:28 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:28 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor



-- 
Miroslav





More information about the Linux-cluster mailing list