[Linux-cluster] Fencing deadlock under Cluster Suite v4, how to solve?

Thu Sep 28 04:36:30 UTC 2006

Hello all,

I'm having a strange problem. Here is the scenario:
* 2-node GFS cluster on 2 Dell PE-2900 servers;
* 1 Dell|EMC CX300 storage, with servers direct attached using two HBAs 
each;
* RHEL AS 4 Update 4, no updates applied;
* Red Hat Cluster Suite v4 Update 4, no updates applied;
* Red Hat GFS Update 4, no updates applied;
* Using IPMI over LAN fencing.

The Cluster was configured quite straight forward, the GFS filesystems 
worked fine.

Since the Dell PowerEdge x9xx series now support IPMI on both LOMs 
(onboard NICs) as a configurable failover option, we decided to "channel 
bond" eth0 and eth1 (onboard NICs) together to have both the normal 
network traffic and also the heartbeat traffic over a redundant channel 
(bond0). Since IPMI works over both NICs, fencing is expected to work 
even if one of the NICs/cables goes down.

Now the problem: whenever I pull both cables from one server, the 
servers almost simultaneously detect each other as offline (the logs 
show "serverX lost too many heartbeats, removing it from the Cluster"). 
A few seconds later and one server fences the other, at the same time!!!

As far as I can tell, there is some delay between the sending of the 
"power off" IPMI command and the real poweroff from the IPMI embedded 
controller. By the way, there is no "normal shutdown" caused by ACPI or 
APM, these are both turned off in the servers.

So it seems that when the first server kills the other, there is enough 
time to the second server to send the IPMI command to kill the first 
server also, and a few seconds later both are turned off, so my 
redundant environment goes down alltogether.

Question: does someone is aware of a solution for this? Is there a way a 
server can notify the other that it is removing it from the cluster? 
Maybe using a shared disk? By the way, I didn't experimented with the 
new shared disk feature under CS v4, only with CS v3.

Thank you all in advance.

Regards,

Celso.
-- 
*Celso Kopp Webber*

-- 
Esta mensagem foi verificada pelo sistema de antivírus e
 acredita-se estar livre de perigo.