[Linux-cluster] Cluster failover troubleshooting
Christopher Strider Cook
ccook at pandora.com
Thu Jan 21 22:44:29 UTC 2010
Please help me figure out why this cluster failed over. This has
happened several times in the past month, while previously it had been
quite stable. What can trigger the corosync " [TOTEM ] A processor
failed, forming new configuration. " message? By all appearances the
primary server was functioning properly until it was fenced by the
secondary.
I've got cluster3 running on debian lenny 2.6.30-1-amd64
ii openais 1.0.0-3local1 Standards-based cluster framework
(daemon an
ii corosync 1.0.0-4 Standards-based cluster framework
(daemon an
ii rgmanager 3.0.0-1~agx0lo clustered resource group manager
ii cman 3.0.0-1~agx0lo cluster manager
A bunch of successful status checks on the active server, nicks, leading
up to:
Jan 21 04:28:08 wonder corosync[2856]: [TOTEM ] A processor failed,
forming new configuration.
Jan 21 04:28:09 wonder qdiskd[2873]: Writing eviction notice for node 2
Jan 21 04:28:10 wonder qdiskd[2873]: Node 2 evicted
Jan 21 04:28:11 nicks corosync[2991]: [CMAN ] lost contact with quorum
device
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] CLM CONFIGURATION CHANGE
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] New Configuration:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.20)
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Left:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.21)
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Joined:
Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] This node is within the
primary component and will provide service.
Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] Members[1]:
Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] 1
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] CLM CONFIGURATION CHANGE
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] New Configuration:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.20)
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Left:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Joined:
Jan 21 04:28:12 wonder corosync[2856]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Jan 21 04:28:12 wonder corosync[2856]: [MAIN ] Completed service
synchronization, ready to provide service.
Jan 21 04:28:12 wonder rgmanager[3578]: State change: nicks-p DOWN
Jan 21 04:28:13 wonder kernel: [1298595.738213] dlm: closing connection
to node 2
Jan 21 04:28:13 wonder fenced[3206]: fencing node nicks-p
Jan 21 04:28:12 nicks corosync[2991]: [QUORUM] This node is within the
primary component and will provide service.
Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] Members[2]:
Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] 1
Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] 2
Jan 21 04:28:16 wonder fenced[3206]: fence nicks-p success
Jan 21 04:28:16 wonder fenced[3206]: fence nicks-p success
Jan 21 04:28:17 wonder rgmanager[3578]: Taking over service
service:MailHost from down member nicks-p
Jan 21 04:28:17 wonder bash[13236]: Unknown file system type 'ext4' for
device /dev/dm-0. Assuming fsck is required.
Jan 21 04:28:17 wonder bash[13259]: Running fsck on /dev/dm-0
Jan 21 04:28:18 wonder bash[13284]: mounting /dev/dm-0 on /home
Jan 21 04:28:18 wonder bash[13306]: mount -t ext4 -o
defaults,noatime,nodiratime /dev/dm-0 /home
Jan 21 04:28:19 wonder bash[13335]: quotaon not found in
/bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:19 wonder bash[13335]: quotaon not found in
/bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:19 wonder bash[13368]: mounting /dev/dm-1 on /var/cluster
Jan 21 04:28:19 wonder bash[13390]: mount -t ext3 -o defaults /dev/dm-1
/var/cluster
Jan 21 04:28:19 wonder bash[13415]: quotaon not found in
/bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:19 wonder bash[13415]: quotaon not found in
/bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:20 wonder bash[13467]: Link for eth0: Detected
Jan 21 04:28:20 wonder bash[13489]: Adding IPv4 address 172.25.16.58/22
to eth0
Jan 21 04:28:20 wonder bash[13513]: Sending gratuitous ARP: 172.25.16.58
00:30:48:c6:df:ce brd ff:ff:ff:ff:ff:ff
Jan 21 04:28:21 wonder bash[13551]: Executing
/etc/cluster/MailHost-misc-early start
Jan 21 04:28:21 wonder bash[13606]: Executing
/etc/cluster/saslauthd-cluster start
Jan 21 04:28:21 wonder bash[13679]: Executing
/etc/cluster/postfix-cluster start
Jan 21 04:28:22 wonder bash[13788]: Executing
/etc/cluster/dovecot-wrapper start
Jan 21 04:28:22 wonder bash[13850]: Executing
/etc/cluster/mailman-wrapper start
Jan 21 04:28:23 wonder bash[13901]: Executing
/etc/cluster/apache2-mailhost start
<?xml version="1.0"?>
<cluster name="alpha" config_version="44">
<cman two_node="0" expected_votes="3">
</cman>
<clusternodes>
<clusternode name="wonder-p" votes="1" nodeid="1">
<fence>
<method name="single">
<device name="pwr01" option="off"/>
<device name="pwr02" option="off"/>
<device name="pwr01" option="on"/>
<device name="pwr02" option="on"/>
</method>
</fence>
</clusternode>
<clusternode name="nicks-p" votes="1" nodeid="2">
<fence>
<method name="single">
<device name="pwr03" option="off"/>
<device name="pwr04" option="off"/>
<device name="pwr03" option="on"/>
<device name="pwr04" option="on"/>
</method>
</fence>
</clusternode>
</clusternodes>
<quorumd interval="1" tko="10" votes="1" label="quorumdisk">
<heuristic program="ping 172.25.19.254 -c1 -t1" score="1" interval="2"
tko="3"/>
</quorumd>
<fence_daemon post_join_delay="20">
</fence_daemon>
<fencedevices>
<fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-2" port="4"
name="pwr01" udpport="161" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-3" port="4"
name="pwr02" udpport="161" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-2" port="3"
name="pwr03" udpport="161" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-3" port="3"
name="pwr04" udpport="161" />
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="mailcluster" restricted="1" ordered="0" >
<failoverdomainnode name="wonder-p" priority="1"/>
<failoverdomainnode name="nicks-p" priority="1"/>
</failoverdomain>
</failoverdomains>
<service name="MailHost" autostart="1" domain="mailcluster" >
<script name="MailHost-early" file="/etc/cluster/MailHost-misc-early" />
<fs name="mailhome" mountpoint="/home" device="/dev/dm-0" fstype="ext4"
force_unmount="1" active_monitor="1"
options="defaults,noatime,nodiratime" />
<fs name="mailcluster" mountpoint="/var/cluster" device="/dev/dm-1"
fstype="ext3" force_unmount="1" active_monitor="1" options="defaults" />
<ip address="172.25.16.58" monitor_link="1" />
<script name="saslauthd" file="/etc/cluster/saslauthd-cluster" />
<script name="postfix" file="/etc/cluster/postfix-cluster" />
<script name="dovecot" file="/etc/cluster/dovecot-wrapper"
__independent_subtree="1" />
<script name="mailman" file="/etc/cluster/mailman-wrapper"
__independent_subtree="1" />
<script name="apache2-mailhost" file="/etc/cluster/apache2-mailhost"
__independent_subtree="1" />
<script name="usermin" file="/etc/init.d/usermin-sb"
__independent_subtree="1" />
<script name="MailHost-late" file="/etc/cluster/MailHost-misc-late" />
</service>
</rm>
</cluster>
Thanks
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100121/d23caec5/attachment.htm>
More information about the Linux-cluster
mailing list