[Linux-cluster] Cluster reboot problems

Fri Jan 23 14:32:25 UTC 2009

Hi,

I've got a 3-node RHEL 5.3 cluster. I'm running the cluster nodes as XEN Dom0 
domains so I can deploy DomU domains as vm services within  the cluster.
Hardware is:

3 x Dell PowerEdge 1855 blades
2 x Dell PowerConnect 5316M Ethernet modules (for eth0 and eth1)

I have a 4th blade acting as an iSCSI target, exporting a 2GB and two 20GB 
targets. The 2GB target is used as /etc/xen/ on the cluster nodes, mounted as 
a _netdev mount in /etc/fstab on the cluster nodes (mounted on /xen, with 
symlinks from /etc/xen to /xen/xen).
All network traffic uses the same switch module, since I'm only using eth0 at 
this time.

To install the nodes, I'm kickstarting from a Satellite, and doing a "yum 
update" followed by a reboot to get to RHEL 5.3.
I also deploy the same cluster.conf to each node (appended to this email).
I then bring up cman, rgmanager. clvmd and gfs on all nodes (using the "Send 
input to all sessions" feature of Konsole to start the services at the same 
time on all nodes). This brings up the cluster, and allows me to mount the 
iSCSI target for /xen.
Starting xend allows me to enable the vm service listed in cluster.conf 
(clusvcadm -e vm:node1)
Oh, I also log *.* to a syslog server so I can see all the logs in one place.

Nodes are:
	c1.eris.qinetiq.com
	c2.eris.qinetiq.com
	c3.eris.qinetiq.com

"So far so good", I think.

So, I enable cman, rgmanager, clvmd, gfs and xend to start on boot and reboot 
the cluster (all three nodes at the same time)

At which point everything starts to fall apart.

As the nodes come up and try and create a cluster, nodes c1 and c2 appear to 
form a cluster, and then fence node c3 when it joins.

When node c3 comes back up and tries to join the cluster, node c1 decides the 
cluster is no-longer quorate, and fences node c2.
When node c2 comes back up and tries to join the cluster, node c1 decides the 
cluster is no-longer quorate, and fences node c3.

This then continues for as long as I'm entertained watching the logs, and 
switch off all three servers.

Does anyone have any insight as to what the difference is between starting the 
cluster services manually, and starting them at boot is, and why that 
difference (because I can't think of any other difference between the two 
states) would cause me to never gain a stable cluster?

I'm at a bit of a loss really - I moved from a 2-node cluster to a 3-node one 
to try and avoid exactly these problems.
I've also had the same problem with a CentOS 5.2 cluster on the same 
hardware - in that case the nodes were still fencing each other the following 
morning, 18 hours later!

Regards,

Mark.

-- 
Mark Watts BSc RHCE MBCS
Senior Systems Engineer
QinetiQ Applied Technologies
GPG Key: http://www.linux-corner.info/mwatts.gpg
-------------- next part --------------
<?xml version="1.0"?>
<cluster alias="WebFarmTest" config_version="1" name="WebFarmTest">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="c1.eris.qinetiq.com" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="DRACMC" modulename="Server-1" action="Off"/>
                                        <device name="DRACMC" modulename="Server-1" action="On"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="c2.eris.qinetiq.com" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="DRACMC" modulename="Server-2" action="Off"/>
                                        <device name="DRACMC" modulename="Server-2" action="On"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="c3.eris.qinetiq.com" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="DRACMC" modulename="Server-3" action="Off"/>
                                        <device name="DRACMC" modulename="Server-3" action="On"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="2"/>
        <fencedevices>
                <fencedevice agent="fence_drac" ipaddr="XXX" login="XXX" name="DRACMC" passwd="XXX"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="webfarm-fd" nofailback="0" ordered="0" restricted="1">
                                <failoverdomainnode name="c1.eris.qinetiq.com" priority="1"/>
                                <failoverdomainnode name="c2.eris.qinetiq.com" priority="1"/>
                                <failoverdomainnode name="c3.eris.qinetiq.com" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
                <vm autostart="1" domain="webfarm-fd" exclusive="1" migrate="live" name="node1" path="/etc/xen/" recovery="relocate"/>
        </rm>
</cluster>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090123/d20c6910/attachment.sig>