[Linux-cluster] RE: Cluster Suite 4 failover problem

Thu Oct 26 14:36:38 UTC 2006

On Sat, 2006-10-21 at 21:59 +0800, Dicky wrote:
> HI  Jeff & Lon,
> 
> Thanks for the reply.
> 
> Regarding the didn't failover issue (just displayed the "Owner --> 
> unknown"  and "State --> started" but actually none services were 
> available), i checked the log and agreed that it should be the 
> fence_manual problem. It is because the log message showed that the 
> fence_manul was waiting node2 to rejoin the cluster, as soon as i 
> executed the command: fence_ack_manual -n node2, the failed services 
> failover to node1, all failed service back to normal.

Sweet.

> I would like to know if there is any solution or workaround for this 
> situation other than buying a fence device :) ????? Can i remove the 
> fence.rpm ??? Will it cause any extra problems????? 

AAhhhhh...

Well, *just* file system and data corruption in the case that a node
*didn't* fully die. ;)

> It is because in 
> production environment, we never know when will the machine down and 
> cannot execute the fence_ack_manual command immediately.

Ok, here's the scoop.  Fencing is there to protect your data.  If you
don't care about your data, or you are not sharing data between the
nodes, then you do not need fencing.

It works by preventing a node which is *believed* to be dead from
writing to your shared data.  The key word is *believed*.  Sometimes,
due to load spikes, live hangs, network partitions, or other events
outside of administrative control, a node is believed to be dead but is
not, in actuality, dead.  If it "wakes up" and starts writing happily
along where it left off, your file system(s) and data probably will not
last long.

> Regarding the monitor_link issue, i have tried to set the "monitor_link 
> =1 " for both resource ip i.e. 192.168.0.111 and 192.168.0.112 , then i 
> shutdown eth0 of node2 and re-enable it,  when i tried to restart the 
> rgmanager in node2 i.e. the failed node, it still showing the msg 
> "Shutting down Cluster Service Manager... Waiting for services to stop: 
> ", i have to kill the rgmanager's processes or even worse i have to 
> reset the machine. Any ideas??

If you want to test link monitoring, yank the cable out.  That's what
it's designed to detect. :)

I suspect this is a case of rgmanager trying to take locks.
Unfortunately, I think CMAN and the DLM would still be using the IP you
just pulled off the system.  Rgmanager is probably blocking trying to
take a lock, and hangs.

Are there any log messages?

> One more thing is even the monitor_link=0 in the cluster.conf, the 
> system-config-cluster --> Resource --> IP address's Monitor Link box is 
> being ticked!!! Why??

That sounds like a bug in system-config-cluster.  Bugzilla.

-- Lon