[Linux-cluster] RH Cluster doesn't pass basic acceptance tests - bug in fenced?

Wed Jan 10 20:16:49 UTC 2007

On Wed, Jan 10, 2007 at 08:47:09PM +0100, Miroslav Zubcic wrote:
> Hi people.
> 
> I had this problem in spring last year while configuring one RH cluster
> for local telco. RH tehnical support was not very useful. They told me
> this is not a bug and so on ... So I will like to ask this here on RH
> cluster list, in hope for better advice.
> 
> When I have 2-node cluster with RSA II management cards (fence_rsa agent)
> configured to have 1 oracle database in failover together with VIP adress
> and 5 luns shared from EMC storage, how can I pass one simple test with
> pooling out main data ethernet cables from active node?
> 
> Let's say that I have interface bond0 (data subnet/vlan) and bond1 (fence
> subnet/vlan) on each node. Our customers (and we also, it is logical) are
> expecting if we pull out all two data cables from bond0 that inactive node
> will kill/fence active node and take over it's services.
> 
> Unfortunately, what we see almost every time on acceptance test is that
> two nodes are killing each other no matter if they have or does not have a
> link.
> 
> Here is fragment from /var/adm/messages on the active node when I disable
> bond0 (by pooling out cables):
> 
> ---------------------------------------------------------------------
> Jan  9 14:05:43 north clurgmgrd: [4593]: <warning> Link for bond0: Not
> detected
> Jan  9 14:05:43 north clurgmgrd: [4593]: <warning> No link on bond0...
> 
> Jan  9 14:05:43 north clurgmgrd[4593]: <notice> status on ip
> "10.156.10.32/26" returned 1 (generic error)
> Jan  9 14:05:43 north clurgmgrd[4593]: <notice> Stopping service ora_PROD
> Jan  9 14:05:53 north kernel: CMAN: removing node south from the cluster :
> Missed too many heartbeats
> Jan  9 14:05:53 north fenced[4063]: north not a cluster member after 0 sec
> post_fail_delay
> Jan  9 14:05:53 north fenced[4063]: fencing node "south"
> Jan  9 14:05:55 north shutdown: shutting down for system halt
> Jan  9 14:05:55 north init: Switching to runlevel: 0
> Jan  9 14:05:55 north login(pam_unix)[4599]: session closed for user root
> Jan  9 14:05:56 north rgmanager: [4270]: <notice> Shutting down Cluster
> Service Manager...
> Jan  9 14:05:56 north clurgmgrd[4593]: <notice> Shutting down
> Jan  9 14:05:56 north fenced[4063]: fence "south" success
> 
> 	 [...]
> 
> Jan  9 14:11:19 north syslogd 1.4.1: restart.
> ----------------------------------------------------------
> 
> As we see here, clurgmgrd(8) on node "north" has DETECTED that there is no
> link, it began to stop service "ora_PROD", system goes in shutdown. So
> far, so good. But then, fenced(8) daemon decides to fence "south" node
> (healthy node which has data link and all presupositions to take over
> ora_PROD service (oracle + IP + 5 ext3 FS's from EMC storage)! Why?
> 
> Of course, south also is fenceing north, and I then have tragicomic
> situation where both nodes are beeing rebooted by eacs other.
> 
> How can I prevent this? This looks like a bug. I don't want fenced to
> fence other node south if it already "knows" that it is the one without link.
> 
> What to do? We cannot pass acceptance tests with such cluster state. :-(
> 
> Thanks for any advice ...
> 

This isn't a bug, its working as expected.  What you need in qdisk, set it up
with the proper hueristics and it will force the shutdown of the bad node before
the bad node has a chance to fence off the working node.

Josef