[Linux-cluster] Fencing deadlock under Cluster Suite v4, how to solve?
Celso K. Webber
celso at webbertek.com.br
Thu Sep 28 04:36:30 UTC 2006
I'm having a strange problem. Here is the scenario:
* 2-node GFS cluster on 2 Dell PE-2900 servers;
* 1 Dell|EMC CX300 storage, with servers direct attached using two HBAs
* RHEL AS 4 Update 4, no updates applied;
* Red Hat Cluster Suite v4 Update 4, no updates applied;
* Red Hat GFS Update 4, no updates applied;
* Using IPMI over LAN fencing.
The Cluster was configured quite straight forward, the GFS filesystems
Since the Dell PowerEdge x9xx series now support IPMI on both LOMs
(onboard NICs) as a configurable failover option, we decided to "channel
bond" eth0 and eth1 (onboard NICs) together to have both the normal
network traffic and also the heartbeat traffic over a redundant channel
(bond0). Since IPMI works over both NICs, fencing is expected to work
even if one of the NICs/cables goes down.
Now the problem: whenever I pull both cables from one server, the
servers almost simultaneously detect each other as offline (the logs
show "serverX lost too many heartbeats, removing it from the Cluster").
A few seconds later and one server fences the other, at the same time!!!
As far as I can tell, there is some delay between the sending of the
"power off" IPMI command and the real poweroff from the IPMI embedded
controller. By the way, there is no "normal shutdown" caused by ACPI or
APM, these are both turned off in the servers.
So it seems that when the first server kills the other, there is enough
time to the second server to send the IPMI command to kill the first
server also, and a few seconds later both are turned off, so my
redundant environment goes down alltogether.
Question: does someone is aware of a solution for this? Is there a way a
server can notify the other that it is removing it from the cluster?
Maybe using a shared disk? By the way, I didn't experimented with the
new shared disk feature under CS v4, only with CS v3.
Thank you all in advance.
*Celso Kopp Webber*
Esta mensagem foi verificada pelo sistema de antivírus e
acredita-se estar livre de perigo.
More information about the Linux-cluster