[Linux-cluster] Failover root cause

Muhammad Panji sumodirjo at gmail.com
Fri Nov 9 00:47:55 UTC 2012


Dear All,
I have an oracle cluster on RHEL 6.2 with 2 servers. Several days ago
the service was failover from node1 to node2. From /var/log/messages
on node2 I only see this message :

...
Oct 23 12:54:19 db2svr corosync[4142]:   [TOTEM ] A processor failed,
forming new configuration.
Oct 23 12:54:21 db2svr corosync[4142]:   [QUORUM] Members[1]: 2
Oct 23 12:54:21 db2svr corosync[4142]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Oct 23 12:54:21 db2svr kernel: dlm: closing connection to node 1
Oct 23 12:54:21 db2svr rgmanager[5327]: State change: clu1 DOWN
Oct 23 12:54:21 db2svr fenced[4193]: fencing node clu1
...

Googling this message " [TOTEM ] A processor failed, forming new
configuration." I learned that it means node2 couldn't see node1 and
then fence node1. on node1 I get this message :

Oct 23 12:50:45 db1svr rgmanager[75890]: [script] Executing
/etc/init.d/httpd status
Oct 23 12:56:01 db1svr kernel: imklog 4.6.2, log source = /proc/kmsg started.
Oct 23 12:56:01 db1svr rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="3792" x-info="http://www.rsyslog.com"]
(re)start
Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpuset
Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpu
Oct 23 12:56:01 db1svr kernel: Linux version 2.6.32-220.el6.x86_64
(mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214
(Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011

on 12:50 rgmanager still checking the service and then it's rebooted.
Thing that make it worse is that the date / time of both servers are
different so that I can't compare the logs directly. Current time
difference between both servers is around 5 minutes.

I would like to ask where to look for the cause of this failover? I
plan to graph sar data today to see if there were bottleneck on CPU
etc so that node1 could not send status to node2, but if no bottleneck
on CPU or RAM etc where should I find the root cause of failover?
thank you.
Regards,





-- 
Muhammad Panji
http://www.panji.web.id
http://www.kurungsiku.com




More information about the Linux-cluster mailing list