[Linux-cluster] Fencing node automatically...

Sat Dec 31 04:39:03 UTC 2011

On 12/30/2011 10:40 PM, SATHYA - IT wrote:
> Hi,
> 
> Herewith attaching the logs and configuration files for ref. Kindly assist.
> 
> Thanks

=================
Dec 25 11:11:26 filesrv1 corosync[9061]:   [TOTEM ] A processor failed,
forming new configuration.
Dec 25 11:11:26 filesrv1 kernel: bnx2 0000:03:00.1: eth3: NIC Copper
Link is Down
Dec 25 11:11:26 filesrv1 kernel: bonding: bond1: link status definitely
down for interface eth3, disabling it
Dec 25 11:11:26 filesrv1 kernel: bonding: bond1: making interface eth4
the new active one.
Dec 25 11:11:27 filesrv1 kernel: bnx2 0000:04:00.0: eth4: NIC Copper
Link is Down
Dec 25 11:11:27 filesrv1 kernel: bonding: bond1: link status definitely
down for interface eth4, disabling it
Dec 25 11:11:27 filesrv1 kernel: bonding: bond1: now running without any
active interface !
Dec 25 11:11:28 filesrv1 corosync[9061]:   [QUORUM] Members[1]: 1
Dec 25 11:11:28 filesrv1 corosync[9061]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Dec 25 11:11:28 filesrv1 rgmanager[12538]: State change: clustsrv2 DOWN
Dec 25 11:11:28 filesrv1 corosync[9061]:   [CPG   ] chosen downlist:
sender r(0) ip(10.0.0.10) ; members(old:2 left:1)
Dec 25 11:11:28 filesrv1 corosync[9061]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Dec 25 11:11:28 filesrv1 kernel: dlm: closing connection to node 2
Dec 25 11:11:28 filesrv1 kernel: GFS2: fsid=samba:ctdb.1: jid=0: Trying
to acquire journal lock...
Dec 25 11:11:28 filesrv1 kernel: GFS2: fsid=samba:gen01.1: jid=0: Trying
to acquire journal lock...
Dec 25 11:11:28 filesrv1 fenced[9120]: fencing node clustsrv2
=================

Do you have the servers directly connected to one another? I don't see
the fence message until a full 2 seconds after the link dropped.

=================
Dec 25 03:30:06 filesrv2 kernel: imklog 4.6.2, log source = /proc/kmsg
started.
Dec 25 03:30:06 filesrv2 rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="8660" x-info="http://www.rsyslog.com"] (re)start
Dec 25 11:14:56 filesrv2 kernel: imklog 4.6.2, log source = /proc/kmsg
started.
Dec 25 11:14:56 filesrv2 rsyslogd: [origin software="rsyslogd"
swVersion="4.6.2" x-pid="8811" x-info="http://www.rsyslog.com"] (re)start
Dec 25 11:14:56 filesrv2 kernel: Initializing cgroup subsys cpuset
Dec 25 11:14:56 filesrv2 kernel: Initializing cgroup subsys cpu
Dec 25 11:14:56 filesrv2 kernel: Linux version 2.6.32-220.el6.x86_64
(mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214
(Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011
Dec 25 11:14:56 filesrv2 kernel: Command line: ro
root=/dev/mapper/vg_filesrv2-LogVol01 rd_LVM_LV=vg_filesrv2/LogVol01
rd_LVM_LV=vg_filesrv2/LogVol00 rd_NO_LUKS rd_NO_MD rd_NO_DM
LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us
crashkernel=128M rhgb quiet acpi=off
Dec 25 11:14:56 filesrv2 kernel: KERNEL supported cpus:
=================

If this node failed, it failed hard as nothing got written to the logs.
Normally with network issues, you would expect to see "failed -> fence
-> network down" on the survivor and at least some portion of this on
the victim. That it just flat out died tells me that something else took
out the lost server, and what you saw was from the cluster is the
results of recovering from that loss.

I do see a crash in the second machine at 11:41:32 on Dec. 26, but there
doesn't seem to be any corresponding data on the first node. Are the
times in sync?

Lastly, I see the '[TOTEM ] Retransmit List:' list bug on the first
server, but not the second one. Are both nodes fully up to date? If they
are, and if you have a RHEL subscription, it might be worth talking to
your support contact.

In short, you seem to have multiple issues. Not entirely sure if they're
related or not, but possibly not which would make debugging tricky. Go
through both servers logs (that you attached here) and look closely at
these issues. Investigate them and see where that takes you.

Cheers

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron