[Linux-cluster] Cluster reboot problems

Wed Jan 28 13:56:45 UTC 2009

I have found this on cluster-2.03.11/doc/usage.txt : 

- To avoid unnecessary fencing when starting the cluster, it's best for
  all nodes to join the cluster (complete cman_tool join) before any
  of them do fence_tool join.

I think something should be fix to resolve this issue.
It is a real problem on a "production" system.

When the fencing domain is closed node (after a fence_tool join),  node could not enter the cluster.
You have to do at the same time on all nodes : 
	#cman_tool join
	#fence_tool join

Strange behaviour... i have this problem on RHEL 5.3.

On Fri, Jan 23, 2009 at 02:32:25PM +0000, Mark Watts wrote:
> 
> Hi,
> 
> I've got a 3-node RHEL 5.3 cluster. I'm running the cluster nodes as XEN Dom0 
> domains so I can deploy DomU domains as vm services within  the cluster.
> Hardware is:
> 
> 3 x Dell PowerEdge 1855 blades
> 2 x Dell PowerConnect 5316M Ethernet modules (for eth0 and eth1)
> 
> I have a 4th blade acting as an iSCSI target, exporting a 2GB and two 20GB 
> targets. The 2GB target is used as /etc/xen/ on the cluster nodes, mounted as 
> a _netdev mount in /etc/fstab on the cluster nodes (mounted on /xen, with 
> symlinks from /etc/xen to /xen/xen).
> All network traffic uses the same switch module, since I'm only using eth0 at 
> this time.
> 
> To install the nodes, I'm kickstarting from a Satellite, and doing a "yum 
> update" followed by a reboot to get to RHEL 5.3.
> I also deploy the same cluster.conf to each node (appended to this email).
> I then bring up cman, rgmanager. clvmd and gfs on all nodes (using the "Send 
> input to all sessions" feature of Konsole to start the services at the same 
> time on all nodes). This brings up the cluster, and allows me to mount the 
> iSCSI target for /xen.
> Starting xend allows me to enable the vm service listed in cluster.conf 
> (clusvcadm -e vm:node1)
> Oh, I also log *.* to a syslog server so I can see all the logs in one place.
> 
> Nodes are:
> 	c1.eris.qinetiq.com
> 	c2.eris.qinetiq.com
> 	c3.eris.qinetiq.com
> 
> "So far so good", I think.
> 
> So, I enable cman, rgmanager, clvmd, gfs and xend to start on boot and reboot 
> the cluster (all three nodes at the same time)
> 
> At which point everything starts to fall apart.
> 
> As the nodes come up and try and create a cluster, nodes c1 and c2 appear to 
> form a cluster, and then fence node c3 when it joins.
> 
> When node c3 comes back up and tries to join the cluster, node c1 decides the 
> cluster is no-longer quorate, and fences node c2.
> When node c2 comes back up and tries to join the cluster, node c1 decides the 
> cluster is no-longer quorate, and fences node c3.
> 
> This then continues for as long as I'm entertained watching the logs, and 
> switch off all three servers.
> 
> 
> Does anyone have any insight as to what the difference is between starting the 
> cluster services manually, and starting them at boot is, and why that 
> difference (because I can't think of any other difference between the two 
> states) would cause me to never gain a stable cluster?
> 
> I'm at a bit of a loss really - I moved from a 2-node cluster to a 3-node one 
> to try and avoid exactly these problems.
> I've also had the same problem with a CentOS 5.2 cluster on the same 
> hardware - in that case the nodes were still fencing each other the following 
> morning, 18 hours later!
> 
> 
> Regards,
> 
> Mark.
> 
> -- 
> Mark Watts BSc RHCE MBCS
> Senior Systems Engineer
> QinetiQ Applied Technologies
> GPG Key: http://www.linux-corner.info/mwatts.gpg

> <?xml version="1.0"?>
> <cluster alias="WebFarmTest" config_version="1" name="WebFarmTest">
>         <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="c1.eris.qinetiq.com" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="DRACMC" modulename="Server-1" action="Off"/>
>                                         <device name="DRACMC" modulename="Server-1" action="On"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="c2.eris.qinetiq.com" nodeid="2" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="DRACMC" modulename="Server-2" action="Off"/>
>                                         <device name="DRACMC" modulename="Server-2" action="On"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="c3.eris.qinetiq.com" nodeid="3" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="DRACMC" modulename="Server-3" action="Off"/>
>                                         <device name="DRACMC" modulename="Server-3" action="On"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="2"/>
>         <fencedevices>
>                 <fencedevice agent="fence_drac" ipaddr="XXX" login="XXX" name="DRACMC" passwd="XXX"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="webfarm-fd" nofailback="0" ordered="0" restricted="1">
>                                 <failoverdomainnode name="c1.eris.qinetiq.com" priority="1"/>
>                                 <failoverdomainnode name="c2.eris.qinetiq.com" priority="1"/>
>                                 <failoverdomainnode name="c3.eris.qinetiq.com" priority="1"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources/>
>                 <vm autostart="1" domain="webfarm-fd" exclusive="1" migrate="live" name="node1" path="/etc/xen/" recovery="relocate"/>
>         </rm>
> </cluster>

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster