[Linux-cluster] fencing for no reason that I can see

Tue Sep 11 01:27:06 UTC 2012

Hello,

I have seen this a few times where one node stops seeing the other
node for some unknown reason and fences it.  Any idea how I can debug
this?  Here's from the node doing the fencing:

Sep 10 19:01:23 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
failed, forming new configuration.
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [QUORUM] Members[1]: 1
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
joined or left the membership and a new membership was formed.
Sep 10 19:01:25 omadvnfs01a rgmanager[10692]: State change:
omadvnfs01b.sec.jel.lc DOWN
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [CPG   ] chosen
downlist: sender r(0) ip(10.198.1.110) ; members(old:2 left:1)
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [MAIN  ] Completed
service synchronization, ready to provide service.
Sep 10 19:01:25 omadvnfs01a fenced[10427]: fencing node omadvnfs01b.sec.jel.lc

And here is from the fenced node:

Sep 10 17:09:27 omadvnfs01b rpc.idmapd[6126]: nfsdcb:
read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
File)
Sep 10 17:14:47 omadvnfs01b rpc.idmapd[6125]: nfsdcb:
read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
File)
Sep 10 19:04:44 omadvnfs01b kernel: imklog 5.8.10, log source =
/proc/kmsg started.
Sep 10 19:04:44 omadvnfs01b rsyslogd: [origin software="rsyslogd"
swVersion="5.8.10" x-pid="2379" x-info="http://www.rsyslog.com"] start

I did notice that they were about 40 seconds off in time.  I just
fixed that but what else can I look for here.  Our monitoring started
noticing things at 19:02:30 that the fenced node was off the grid
which is a little after it was fenced.  What test is performed to see
if the other node is up?  How many times does it try?

Thanks!