[Linux-cluster] Network hiccup + power-fencing = both nodes go down (redhat cluster 4)

Tue Jan 17 11:48:42 UTC 2006

Hi all, it has been a while since I posted anything.  Once again, I'd
appreciate anything anyone has to say regarding this latest issue.
Basically, we have a situation where both nodes are suddenly unable to
reach each other due to a "network hiccup", and they begin trying to
fence each other (power fencing).  Then suddenly, the network returns
and they turn each other off.  My need: make redhat cluster robust
enough not to do this.  It could be that my configurations are wrong,
and I'm going to include them (attached).

My idea/solution: I THINK I could increase the post-fail-delay to a
higher number than 0, thus making it wait to see if things "come back
up".  Perhaps I make 1 node wait like 2 minutes for the other one to
come up, and another node wait zero seconds.  Thus insuring that nobody
does anything at the same time?

Some small proof that the dual-reboot happened:

I know that both boxes fenced the other and "succeeded", and my ILO
event logs show both servers being powered off.

Thanks a lot,

Jeff

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060117/482de11e/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster_db2.conf
Type: application/octet-stream
Size: 1392 bytes
Desc: cluster_db2.conf
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060117/482de11e/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster_db1.conf
Type: application/octet-stream
Size: 1392 bytes
Desc: cluster_db1.conf
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060117/482de11e/attachment-0001.obj>