[Linux-cluster] RH Cluster doesn't pass basic acceptance tests - bug in fenced?

Wed Jan 10 19:47:09 UTC 2007

Hi people.

I had this problem in spring last year while configuring one RH cluster
for local telco. RH tehnical support was not very useful. They told me
this is not a bug and so on ... So I will like to ask this here on RH
cluster list, in hope for better advice.

When I have 2-node cluster with RSA II management cards (fence_rsa agent)
configured to have 1 oracle database in failover together with VIP adress
and 5 luns shared from EMC storage, how can I pass one simple test with
pooling out main data ethernet cables from active node?

Let's say that I have interface bond0 (data subnet/vlan) and bond1 (fence
subnet/vlan) on each node. Our customers (and we also, it is logical) are
expecting if we pull out all two data cables from bond0 that inactive node
will kill/fence active node and take over it's services.

Unfortunately, what we see almost every time on acceptance test is that
two nodes are killing each other no matter if they have or does not have a
link.

Here is fragment from /var/adm/messages on the active node when I disable
bond0 (by pooling out cables):

---------------------------------------------------------------------
Jan  9 14:05:43 north clurgmgrd: [4593]: <warning> Link for bond0: Not
detected
Jan  9 14:05:43 north clurgmgrd: [4593]: <warning> No link on bond0...

Jan  9 14:05:43 north clurgmgrd[4593]: <notice> status on ip
"10.156.10.32/26" returned 1 (generic error)
Jan  9 14:05:43 north clurgmgrd[4593]: <notice> Stopping service ora_PROD
Jan  9 14:05:53 north kernel: CMAN: removing node south from the cluster :
Missed too many heartbeats
Jan  9 14:05:53 north fenced[4063]: north not a cluster member after 0 sec
post_fail_delay
Jan  9 14:05:53 north fenced[4063]: fencing node "south"
Jan  9 14:05:55 north shutdown: shutting down for system halt
Jan  9 14:05:55 north init: Switching to runlevel: 0
Jan  9 14:05:55 north login(pam_unix)[4599]: session closed for user root
Jan  9 14:05:56 north rgmanager: [4270]: <notice> Shutting down Cluster
Service Manager...
Jan  9 14:05:56 north clurgmgrd[4593]: <notice> Shutting down
Jan  9 14:05:56 north fenced[4063]: fence "south" success

	 [...]

Jan  9 14:11:19 north syslogd 1.4.1: restart.
----------------------------------------------------------

As we see here, clurgmgrd(8) on node "north" has DETECTED that there is no
link, it began to stop service "ora_PROD", system goes in shutdown. So
far, so good. But then, fenced(8) daemon decides to fence "south" node
(healthy node which has data link and all presupositions to take over
ora_PROD service (oracle + IP + 5 ext3 FS's from EMC storage)! Why?

Of course, south also is fenceing north, and I then have tragicomic
situation where both nodes are beeing rebooted by eacs other.

How can I prevent this? This looks like a bug. I don't want fenced to
fence other node south if it already "knows" that it is the one without link.

What to do? We cannot pass acceptance tests with such cluster state. :-(

Thanks for any advice ...

-- 
Miroslav Zubcic, Nimium d.o.o., email: <mvz at nimium.hr>
Tel: +385 01 4852 639, Fax: +385 01 4852 640, Mobile: +385 098 942 8672
Mrazoviceva 12, 10000 Zagreb, Hrvatska