[Linux-cluster] proper cluster crash procedures?

Mon Sep 29 08:16:08 UTC 2008

Here is my cluster.conf

#########################################

<?xml version="1.0"?>
<cluster alias="myiacon" config_version="16" name="myiacon">
	<fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="60"/>
	<clusternodes>
		<clusternode name="ratchet.local" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device name="ratchet_ipmi"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="skydive.local" nodeid="2" votes="1">
			<fence>
				<method name="1">
					<device name="skydive_ipmi"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="wheeljack.local" nodeid="3" votes="1">
			<fence>
				<method name="1">
					<device name="wheeljack_ipmi"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman/>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.100"
login="root" name="ratchet_ipmi" passwd="xxxxx"/>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.102"
login="root" name="skydive_ipmi" passwd="xxxxx"/>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.101"
login="root" name="wheeljack_ipmi" passwd="xxxxxx"/>
	</fencedevices>
	<rm>
		<failoverdomains/>
		<resources/>
	</rm>
</cluster>

#############################################

And here is one of the errors I just started getting:

Sep 29 08:10:06 wheeljack openais[5453]: [MAIN ] Killing node ratchet.local
beca    use it has rejoined the cluster with existing state

But half the time, servers just complain that they cant reconnect to the
cluster.

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Mark Chaney
Sent: Monday, September 29, 2008 3:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] proper cluster crash procedures?

I have a 3 node cluster that has shared storage using iscsi san, hence I am
using GFS. Anyway, I had it crash for whatever reason, not sure if something
was rebooted incorrectly or what, but now I have been spending the past 2
hours trying to get the cluster back up. I would think that sampling
rebooting all the nodes would work, but heck, that hasn't. What should I be
doing? Should I just start up one at a time? BTW, I am using ipmi for
fencing if that makes a difference. I can post my cluster.conf if that's
helpful, but I would think there would be general techniques available.

Thanks,
Mark

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster